{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第12章 目标检测"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 12.1 简介"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "目标检测也是计算机视觉中的一个核心问题，其目的是识别图像中每个目标的类别并且对它们进行定位。定位即确定目标的位置，通常需要勾勒出目标轮廓的边界框（bounding box），确定该边界框的中心点坐标以及宽和高。用数学语言来描述目标检测的问题即是：定义图像空间$\\mathbb{I}$和预定义的类别集合$\\mathcal{C}$，给定数据集$\\mathcal{D} = \\{(\\mathbf{I}^{(n)},\\mathcal{Y}^{(n)})\\}_{n=1}^N$，其中$\\mathbf{I}^{(n)}\\in \\mathbb{I}$是数据集中的第$n$张图像，$\\mathcal{Y}^{(n)}=\\{<c^{(n,m)},\\mathbf{b}^{(n,m)}>\\}_{m=1}^{M^{(n)}}$是其对应的目标边界框标签，$M^{(n)}$是这张图像中目标的个数，二元组$<c^{(n,m)},\\mathbf{b}^{(n,m)}>$是这张图像中第$m$个目标的边界框标签，$c^{(n,m)}\\in\\mathcal{C}$是类别标签，$\\mathbf{b}^{(n,m)}=(b_x,b_y,b_h,b_w)^{(n,m)}$是边界框坐标，包括边界框中心点坐标$(b_x,b_y)$以及边界框长宽$(b_h,b_w)$；图像分类的任务是从$\\mathcal{D}$中学习得到一个从图像空间到边界框集合的映射$f:\\mathbb{I} \\rightarrow \\mathcal{Y}$，从而给定任意一张测试图像$\\mathbf{I}$，我们可以用学习得到的映射函数$f$预测该图像中目标边界框集合：$\\hat{\\mathcal{Y}}=f(\\mathbf{I})$。图 12-1 显示了一个目标检测的例子，图中的“狗”、“猫”和“鸭子”被识别出，它们的边界框也被勾勒出。由于目标检测的输出是边界框这种几何规则的矩形，在工业领域中对下游业务适用性高，所以得到广泛应用。例如辅助驾驶系统中的感知模块的输出就是检测到的目标的边界框。在这一章中，我们将一起学习基于深度学习的目标检测方法，并动手实现相应的目标检测框架。\n",
    "\n",
    "\n",
    "<center>\n",
    "    <img style=\"border-radius: 0.3125em;\" \n",
    "    src=\"https://pic3.zhimg.com/80/v2-94fb384365a9ddd17e05e8a4a6e5fc6a_1440w.jpg\n",
    "\" width=300>\n",
    "    <br>\n",
    "    <div style=\"color:orange; \n",
    "    display: inline-block;\n",
    "    color: #999;\n",
    "    padding: 2px;\">图12-1 目标检测示例。</div>\n",
    "</center>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 12.2 数据集和评测指标"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在目标检测中，常用的数据集包括：Pascal VOC、MS COCO与ImageNet，Pascal VoC和ImageNet数据集在之前的章节中我们已经介绍过，因此不再赘述。\n",
    "\n",
    "MS COCO（Microsoft Common Objects in Context）数据集是一个广泛用于计算机视觉任务的大型数据集，于2014年由微软发布。它主要用于目标检测、分割、关键点检测和图像描述生成等任务。COCO 数据集的目标是为了推动计算机视觉技术在更复杂的场景中的发展。MS COCO 数据集包含以下特点：\n",
    "\n",
    "1. 大量图像：数据集包含了约20万张标注过的图像，以及40万张未标注的图像。\n",
    "2. 多样性丰富：图像涵盖了80个类别的各种物体，例如动物、交通工具、家具等，同时图像来源丰富，包括从网络上抓取的图片、街景图片和室内图片等。\n",
    "3. 复杂场景：COCO 数据集的图像包含了各种复杂的场景，如拥挤的市场、交通堵塞的道路等，这为计算机视觉模型提供了较高的挑战。\n",
    "\n",
    "目标检测最常用的目标检测评价指标是mAP（mean Average Precision）。从mAP的命名可以看出，这是一个定义在每个类别上的精度指标。所以在定义mAP之前，需要先定义一个目标是否被检测到的准则。如图 12-2 所示，对于图像中的一个目标，令其真实的（ground-truth）边界框内部的像素集合为$\\mathcal{A}$，一个预测的边界框（称为检测框）内部的像素集合为$\\mathcal{B}$，通过这两个框内像素集合的IoU来定义两个框的重合度：$\\texttt{IoU}(\\mathcal{A},\\mathcal{}B)=\\frac{\\mathcal{A} \\cap \\mathcal{B}}{\\mathcal{A} \\cup \\mathcal{B}}$，即图中绿色部分的面积除以红色的面积。注意这里IoU的定义与上一章语义分割中对IoU的定义在数学上等价，在目标检测中，该IoU也称为Box IoU。给定一个阈值$\\theta$，如果$\\texttt{IoU}(\\mathcal{A},\\mathcal{B})>\\theta$，则认为$\\mathcal{B}$是一个命中目标$\\mathcal{A}$的检测框；反之，则认为检测框并未命中目标$\\mathcal{A}$。根据这个基于IoU的准则，对每个检测框，如果它的预测类别和某个真实边界框的类别一致且它们之间的IoU大于给定阈值，则这个检测框是一个正确检测（true positive，TP）；如果它和任意一个真实边界框的IoU都小于给定阈值，那么这个检测框是一个误检测（false positive, FP）；如果一个真实边界框没有被检测到，那么这是一个漏检测（false negative, FN）。这样对每张图像的检测结果，可以计算出查全率（Recall)，即正确检测相对于所有真实边界框的比例，以及查准率（Precision)，即正确检测相对于所有检测框的比例：\n",
    "\n",
    "$$\\texttt{Recall} = \\frac{\\texttt{TP}}{\\texttt{TP}+\\texttt{FN}}$$\n",
    "\n",
    "$$\\texttt{Precision} = \\frac{\\texttt{TP}}{\\texttt{TP}+\\texttt{FP}}$$\n",
    "\n",
    "<center>\n",
    "    <img style=\"border-radius: 0.3125em;\" \n",
    "    src=\"https://pic4.zhimg.com/80/v2-f68491e05d2e55a3d08425cdfe13a317_1440w.webp\n",
    "\" width=300>\n",
    "    <br>\n",
    "    <div style=\"color:orange; \n",
    "    display: inline-block;\n",
    "    color: #999;\n",
    "    padding: 2px;\">图12-2 两个边界框的重合度通过IoU度量。</div>\n",
    "</center>\n",
    "\n",
    "注意，由于检测框是否命中真实边界框由IoU阈值决定，通过调整IoU阈值，会得到不同的查全率和查准率。通过将IoU阈值从0.0慢慢调大至1.0，可以得到一个序列的查全率和查准率对。那么，基于这样一个查全率和查准率对的序列，可以做出一条查准率-查全率（PR）曲线，如图 12-3 所示。这条PR曲线下方所围成的面积称之为AP（Average Precision），而mAP是指各个类别AP的平均值。\n",
    "\n",
    "<center>\n",
    "    <img style=\"border-radius: 0.3125em;\" \n",
    "    src=\"https://pic4.zhimg.com/80/v2-1c4e1787bbf02f20b10fbf4e30c2eab3_1440w.jpg\n",
    "\" width=300>\n",
    "    <br>\n",
    "    <div style=\"color:orange; \n",
    "    display: inline-block;\n",
    "    color: #999;\n",
    "    padding: 2px;\">图12-2 查准率-查全率（PR）曲线。曲线下方所围成的面积为AP（Average Precision）。</div>\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 12.3 目标检测模型——从R-CNN到Faster R-CNN"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "虽然在2012年Geoffrey Hinton和他的学生提出AlexNet（见第11章），证明了深度卷积神经网络在图像分类上的突出性能，也开启了计算机视觉的深度学习时代。但是在接下来的一两年里，人们并没有找到能够有效解决目标检测这样一个视觉识别任务的深度模型。这是因为，目标检测过程通常基于一种滑动窗口（sliding window）的策略，即用窗口在图像上滑动，滑动到一个位置，就通过一个分类器去判断该位置窗口内是否存在目标。由于目标大小是未知的，这种滑动窗口策略需要在图像上每个位置尝试多种不同大小不同长宽比的窗口，每张图总共需要尝试的窗口数可能高达数十万甚至数百万。如果用卷积神经网络作为窗口分类器，那么一张图像的处理时长可能都需要好几个小时。这种处理速度是根本无法接受的。\n",
    "\n",
    "那么是否有办法提升这种基于滑动窗口的检测速度呢？2014年，Ross Girshick等人给出了解决方案——R-CNN（Region CNN） [1]。他们提出不用遍历所有位置所有尺寸的窗口，而是只需要找到可能存在目标的候选窗口，然后再用卷积神经网络对这些候选窗口做分类就可以了。那么如何找到可能存在目标的候选窗口呢？在2010年左右，有一个问题被提出，叫做类别无关目标检测（object proposal detection）[2]，即不关心目标的类别是什么，只从图像中快速找到可能存在目标的区域。每个区域可能存在目标的置信度也称为objectness。基于类别无关目标检测就可以找到可能存在目标的候选窗口，这些候选窗口的数量远远小于所有滑动窗口的数量，然后再对这些候选窗口用卷积神经网络进行目标类别识别。这便是R-CNN的基本思想。2014年，R-CNN在PASCAL VOC检测数据集上以绝对优获得第一名，为目标检测开启了一个新的里程碑。随后Girshick等人又对R-CNN做了进一步改进和优化，提出了Fast R-CNN[3]和Faster R-CNN[4]，不但进一步提升了速度，也提升了检测精度。接下来，我们将详细介绍R-CNN系列算法的原理。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 12.3.1 R-CNN"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "R-CNN是基于深度学习进行目标检测的开山之作，其流程如图 12-4 所示，可大致分为以下几个步骤：\n",
    "\n",
    "\n",
    "1. 从每张输入图像提取1000~2000个候选区域（RoI，region of interest），即上述的候选窗口；\n",
    "2. 利用深度卷积神经网络（CNN）提取每个候选区域的特征；\n",
    "3. 基于提取的深度卷积特征，利用SVM分类器对每个候选区域进行目标类别识别；\n",
    "4. 基于提取的深度卷积特征，利用坐标回归器修正候选框位置。\n",
    "\n",
    "\n",
    "<center>\n",
    "    <img style=\"border-radius: 0.3125em;\" \n",
    "    src=\"https://pic4.zhimg.com/80/v2-5a8f9899f6809d6b54fca95254f223ab_1440w.webp\n",
    "\" width=600>\n",
    "    <br>\n",
    "    <div style=\"color:orange; \n",
    "    display: inline-block;\n",
    "    color: #999;\n",
    "    padding: 2px;\">图12-4 R-CNN流程。</div>\n",
    "</center>\n",
    "\n",
    "R-CNN分为训练和测试两个阶段，这两个阶段大体上遵循上述步骤，但是略有不同。接下来，我们分别详细介绍R-CNN的训练和测试步骤。\n",
    "\n",
    "在训练阶段，首先利用类别无关目标检测方法，如Selective Search[5]，在每张训练图像上提取1000～2000个候选区域。Selective Search是一个经典的类别无关目标检测方法。该方法先通过第9章中介绍的图像分割方法得到图像中的一些分割区域，然后基于区域特征，如颜色、亮度和纹理等，分层次合并区域，从图像中提取出可能包含有目标的候选区域。接着，把所有候选区域归一化成同一空间分辨率，如$227\\times227$，然后再训练一个CNN网络用于提取这些候选区域的特征。训练这个CNN网络一般基于在ImageNet数据集上预训练好的主干网络，如Alexnet、VGG、ResNet等。训练时，需要先对每个候选区域指定其类别标签$u\\in\\mathcal{C}$。令第$n$张训练图像$\\mathbf{I}^{(n)}$中的一个候选区域$R$内的像素集合为$\\mathcal{R}$，首先找到与该候选区域$\\mathcal{R}$重合度最高的一个真实边界框：$m^\\ast=\\arg\\max_m\\texttt{IoU}(\\mathcal{R},\\mathcal{B}^{(n,m)})$，其中$\\mathcal{B}^{(n,m)}$表示真实边界框$\\mathbf{b}^{(n,m)}$内的像素集合，该候选区域的类别标签$u$通过如下准则确定\n",
    "\n",
    "$$u = \\left\\{\\begin{matrix}\n",
    "  c^{(n,m^\\ast)} &  {\\rm if} \\texttt{IoU}(\\mathcal{R},\\mathcal{B}^{(n,m^\\ast)})>0.5\\\\\n",
    "  0 & {\\rm otherwise}\n",
    "\\end{matrix}\\right.$$\n",
    "\n",
    "\n",
    "$u=0$说明该候选区域对应背景，因为没有任何一个真实边界框和它的重合度足够大。指定好每个候选区域的类别标签后，便可以基于该指定的类别标签训练CNN网络用于类别分类（训练方法参见第10章中介绍的基于深度学习的图像分类）。然后，利用这个训练好的分类网络提取候选区域的特征。基于CNN网络提取的特征，再对每一个类别训练一个SVM分类器，用于每个候选区域的类别标签预测。除此之外，R-CNN中还使用了边界框坐标的回归器，对候选区域的边界框（候选框）进行位置调整。这里需要对候选框的位置进行调整的原因是候选框的初始位置往往与真实边界框存在偏差。图 12-6显示了一个直观的例子：红框表示目标猫的真实边界框，黑框为预测的候选框，需要对黑框进行调整，才能使其最终的位置与红框更为接近。具体而言，对上述候选区域$R$，设其候选框坐标为$\\mathbf{r}=(r_x,r_y,r_h,r_w)$，其中$(r_x,r_y)$表示该候选框中心点的坐标，$r_h,r_w$表示该候选框的长和宽，其对应的真实边界框坐标为$\\mathbf{b}=(b_x^{(n,m^\\ast)},b_y^{(n,m^\\ast)},b_h^{(n,m^\\ast)},b_w^{(n,m^\\ast)})$。为了符号的简洁性，在后续的表达中我们省略上标${(n,m^\\ast)}$。那么候选框$\\mathbf{r}$相对于对应的真实边界框$\\mathbf{b}$所要回归的偏移量$\\mathbf{v}=(v_x,v_y,v_h,v_w)$定义为：\n",
    "\n",
    "$$v_x = \\frac{b_x-r_x}{r_w}$$\n",
    "\n",
    "$$v_y = \\frac{b_y-r_y}{r_h}$$\n",
    "\n",
    "$$v_h = \\log\\frac{b_h}{r_h}$$\n",
    "\n",
    "$$v_w = \\log\\frac{b_w}{r_w}$$\n",
    "\n",
    "由上面定义可知，所要回归的偏移量包括候选框与真实边界框中心点的位移$(v_x,v_y)$以及长宽的缩放比例$(v_h,v_w)$。该回归任务可以通过一个线性回归器实现。令该候选框经过上述训练好的CNN主干网络所提取的特征为$\\Phi(\\mathbf{p})$，则可以用$\\mathbf{w}^\\mathsf{T}\\Phi(\\mathbf{p})$拟合候选框相对于真实边界框的偏移量$\\mathbf{v}$，其中$\\mathbf{w}$是可学习的参数。该坐标回归任务的损失函数定义为\n",
    "\n",
    "$$ L =  \\|{\\mathbf{w}}^\\mathsf{T}\\Phi(\\mathbf{p}) - \\mathbf{v}\\|^2 + \\|\\mathbf{w}\\|^2$$\n",
    "\n",
    "\n",
    "\n",
    "<!--如果偏差很大，比如IOU<0.5，那么即便该候选框的预测类别与真实边界框的类别一致，该候选框仍然被认为没有正确检测到目标，是一个误检（false positive）。图 12-6显示了一个直观的例子：红框表示目标猫的真实边界框，黑框为预测候选框，即便黑框被分类器识别为猫，但由于和真实边界框的偏差较大，这个黑框被认为并没有检测到目标猫。所以需要通过一个坐标回归器对黑框进行位置微调，使得经过微调后的窗口跟真实边界框更接近。-->\n",
    "\n",
    "<center>\n",
    "    <img style=\"border-radius: 0.3125em;\" \n",
    "    src=\"https://pic3.zhimg.com/80/v2-a000dafdd9cf5cc628cc5a247040cc96_1440w.jpg\n",
    "\" width=300>\n",
    "    <br>\n",
    "    <div style=\"color:orange; \n",
    "    display: inline-block;\n",
    "    color: #999;\n",
    "    padding: 2px;\">图12-5 候选框（黑色）和真实边界框（红色）。</div>\n",
    "</center>\n",
    "\n",
    "R-CNN测试阶段的步骤和训练阶段类似，给定一张测试图像，同样也是用类别无关目标检测方法提取1000～2000个候选框，然后将它们的空间分辨率归一化到$227\\times227$，接着用训练好的CNN网络提取这些候选框的特征。基于提取的候选框特征，用训练好的SVM分类器和边界框坐标回归器分别预测每个候选框的目标类别标签以及微调其位置坐标。由于候选框的数量非常多，会造成很多误检，所以测试阶段多一个关键步骤——非极大值抑制（non-maximum suppression，NMS），用以去除冗余的候选框：对于每一个类别，先挑选出置信度最大的候选框（该置信度可以是SVM分类器输出的类别概率），设为$\\mathcal{A}$，并计算与其它候选框（如$\\mathcal{B}$）之间的$\\texttt{IoU}(\\mathcal{A},\\mathcal{B})$。若$\\texttt{IoU}(\\mathcal{A},\\mathcal{B})$大于给定的阈值，则将候选框$B$删去。重复上述过程，我们便得到了每一个类别得分最高的一些候选区域，如图 12-6 所示。\n",
    "\n",
    "\n",
    "<center>\n",
    "    <img style=\"border-radius: 0.3125em;\" \n",
    "    src=\"https://pic1.zhimg.com/80/v2-0ba2b8b03f9461a8776f959f83ab0548_1440w.jpg\n",
    "\" width=600>\n",
    "    <br>\n",
    "    <div style=\"color:orange; \n",
    "    display: inline-block;\n",
    "    color: #999;\n",
    "    padding: 2px;\">图12-6 利用NMS去除冗余的候选区域。</div>\n",
    "</center>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 12.3.2 Fast R-CNN"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "R-CNN的思路虽然很直接，但是很明显也存在着许多的问题：1. 对于每张图像中的每个候选区域都需要使用CNN主干网络计算一次特征，而一张图像通常有2000个候选区域，导致计算效率低；2. R-CNN的整个流程分为三个部分（候选框提取、特征提取以及分类器和回归器训练），而SVM分类和坐标回归的结果无法反馈给前端的CNN主干网络用以更新网络参数。为解决这些问题，Ross Girshick于2015年提出了Fast R-CNN [8]。Fast R-CNN的设计更为紧凑，极大提高了目标检测速度。使用AlexNet作为主干网络时，与R-CNN相比，Fast R-CNN训练时间从84小时减少到9.5小时，测试时间从每张图像47秒减少到每张图像0.32秒。\n",
    "\n",
    "\n",
    "Fast R-CNN整体流程如图 12-7 所示，可分为三个步骤：\n",
    "1. 从每张输入图像提取候选区域；\n",
    "2. 利用卷积神经网络计算每张图像的特征图；\n",
    "3. 通过候选区域池化（RoI Pooling）从图像的特征图直接提取每个候选区域的特征，用于候选框目标类别分类与候选框坐标回归。\n",
    "\n",
    "\n",
    "<center>\n",
    "    <img style=\"border-radius: 0.3125em;\" \n",
    "    src=\"https://pic3.zhimg.com/80/v2-c3b7fc34daa7aeaf778e284d2fb1c742_1440w.jpg\n",
    "\" width=600>\n",
    "    <br>\n",
    "    <div style=\"color:orange; \n",
    "    display: inline-block;\n",
    "    color: #999;\n",
    "    padding: 2px;\">图12-7 Fast R-CNN流程。</div>\n",
    "</center>\n",
    "\n",
    "相较于R-CNN，Fast R-CNN有两个重大改进：1. 不再对每个候选区域单独提取特征，而是在提取整个图像的特征图后，直接从特征图提取每个候选区域的特征，避免了多次用主干网络计算特征的过程；2. Fast R-CNN的候选框特征提取与分类器及回归器的训练是一个端到端的过程，从而分类和回归的结果可以反馈给CNN主干网络用以更新网络参数。\n",
    "\n",
    "实现这两个改进的关键是引入一个候选区域特征提取器，这个特征提取器需要从整张图像的特征图上提取每个候选区域的特征，且不同空间分辨率的候选区域需要具有统一的特征维度，以方便接入后续的分类器和回归器进行训练。Fast R-CNN中提出的候选区域特征提取器叫做候选区域池化（RoI Pooling）。简单来说，RoI Pooling层将每个候选区域均匀分成$M \\times N$块，然后对每块进行池化操作。尽管输入图像的空间分辨率不同，得到的特征图空间分辨率也不同，但都可以通过RoI Pooling层产生相同空间分辨率的RoI特征图。那么我们该如何得到图像上的候选区域在特征图中的映射区域呢？\n",
    "\n",
    "在Fast R-CNN中，要找到图像上候选区域在特征图上的映射区域，需要首先了解主干网络（例如ResNet或VGG等）中卷积层和池化层的操作对特征图空间分辨率的影响。例如，一张图像在经过一个步长为2的卷积层或池化层后，输出的特征图的宽度和高度都会减半。\n",
    "\n",
    "假设当前特征图与原始输入图像相比存在一个空间下采样因子（spatial downsampling factor）$S$。例如，在使用VGG-16作为主干网络时，其最后一个卷积层的输出的特征图的空间下采样因子$S=32$（因为下采样了5次）。对于给定的原始图像上的候选区域，我们可以通过以下步骤找到特征图上的映射区域：\n",
    "\n",
    "1. 量化原始坐标：首先，量化原始图像上的候选区域的坐标（左上角 $(x_1, y_1)$ 和右下角 $(x_2, y_2)$）。量化是为了消除主干网络处理过程中可能引入的小数坐标。我们可以将原始坐标除以空间下采样因子$S$，然后对左上角坐标向下取整或对右下角坐标向上取整得到量化后的坐标，如图 12-8 所示，通过量化操作便可以将红色虚线框量化为实线框。例如，假设原始候选区域的坐标为 $(x_1, y_1, x_2, y_2) = (50, 50, 200, 200)$，并且空间下采样因子为 $S=16$。那么量化后的坐标为 $(\\lfloor \\frac{50}{16} \\rfloor, \\lfloor \\frac{50}{16} \\rfloor, \\lceil \\frac{200}{16} \\rceil, \\lceil \\frac{200}{16} \\rceil) = (3, 3, 13, 13)$。\n",
    "\n",
    "2. 计算映射区域：根据量化后的坐标，可以找到特征图上与原始图像上候选区域对应的映射区域。在上述例子中，量化后的坐标为 $(3, 3, 13, 13)$，这意味着特征图上的映射区域的左上角坐标为 $(3, 3)$，右下角坐标为 $(13, 13)$。\n",
    "\n",
    "有了特征图上的映射区域，便可以使用 RoI Pooling 层来提取这些区域的特征了。\n",
    "\n",
    "\n",
    "<center>\n",
    "    <img style=\"border-radius: 0.3125em;\" \n",
    "    src=\"https://pic4.zhimg.com/80/v2-1276f4a41511f77c0231beddcc3cb2fb_1440w.jpg\n",
    "\" width=400>\n",
    "    <br>\n",
    "    <div style=\"color:orange; \n",
    "    display: inline-block;\n",
    "    color: #999;\n",
    "    padding: 2px;\">图12-8 对特征图中候选区域的坐标进行量化。</div>\n",
    "</center>\n",
    "\n",
    "\n",
    "该RoI特征图通过两个全连接层后，进行区域的类别分类和区域边界框坐标回归。在训练阶段，对于每个RoI，需要为其指定一个真实类别标签$u$和边界框坐标回归偏移量$\\mathbf{v}$，采用的指定方式与R-CNN中的相同。Fast R-CNN中定义该RoI上的损失函数为：\n",
    "\n",
    "$$L\\left(\\mathbf{q}, u, \\mathbf{t}, \\mathbf{v}\\right)=L_{cls}(\\mathbf{q}, u)+\\lambda\\mathbb{1}_{u\\geq 1}L_{loc}\\left(\\mathbf{t}, \\mathbf{v}\\right)$$\n",
    "\n",
    "其中，$\\mathbf{q}$是Fast R-CNN模型对该RoI区域的类别分类输出，是一个定义在$|\\mathcal{C}|+1$（$+1$表示背景类别）个类别上的概率分布；$\\mathbf{t}$是Fast R-CNN模型预测的该RoI边界框坐标回归的偏移量；$\\mathbb{1}_{u\\geq 1}$是一个指示函数（indicator function），在$u\\geq 1$时返回1，否则返回0，因为当候选框被认为是背景类别时不需要进行候选框的坐标回归。\n",
    "\n",
    "损失函数由两部分组成，一个是分类损失$L_{cls}$，一个是候选框坐标回归损失$L_{loc}$。其中，$L_{cls} = -\\log(q_u)$，$q_u$表示预测的分布$\\mathbf{q}$在类别$u$上的概率值。而$L_{loc}$是平滑1-范数（smooth $L_1$）损失，定义如下：\n",
    "\n",
    "$$L_{loc}(\\mathbf{t}, \\mathbf{v}) = \\sum_{j\\in\\{x,y,w,h\\}} {\\rm smooth}_{L_1}(t_j-{v}_j)$$\n",
    "\n",
    "其中，\n",
    "\n",
    "$${\\rm smooth}_{L_1}(x) = \\left\\{\\begin{matrix}\n",
    "  0.5x^2 &  {\\rm if} |x|<1 \\\\\n",
    "  |x|-0.5 & {\\rm otherwise}\n",
    "\\end{matrix}\\right.$$\n",
    "\n",
    "\n",
    "Fast R-CNN在测试阶段的流程与训练阶段类似，也是包含上述三个步骤。与R-CNN的测试阶段类似，由于生成的候选区域可能包含多个高度重叠的预测边界框，最后需要使用非极大值抑制来合并这些重叠的预测框。\n",
    "\n",
    "接下来，我们来学习下Fast R-CNN中RoI Pooling这一核心模块的代码实现。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "def roi_pooling(feature_map, rois, output_size):\n",
    "    # feature_map: 输入特征图，shape为 (H, W, C)\n",
    "    # rois: 包含感兴趣区域的坐标和大小，shape为 (num_rois, 4)\n",
    "    # output_size: RoI池化后输出的大小，为一个标量或一个长度为2的元组\n",
    "\n",
    "    # 将RoI坐标和大小转换为整数\n",
    "    rois = np.round(rois).astype(np.int32)\n",
    "\n",
    "    # 计算每个RoI的高度和宽度\n",
    "    roi_heights = rois[:, 2] - rois[:, 0]  # (num_rois,)\n",
    "    roi_widths = rois[:, 3] - rois[:, 1]  # (num_rois,)\n",
    "\n",
    "    # 计算每个RoI的垂直和水平步长\n",
    "    stride_y = roi_heights / output_size[0]\n",
    "    stride_x = roi_widths / output_size[1]\n",
    "\n",
    "    # 初始化输出数组\n",
    "    pooled_rois = np.zeros((rois.shape[0], output_size[0], output_size[1], feature_map.shape[2]))\n",
    "\n",
    "    # 对每个RoI进行池化操作\n",
    "    for i, roi in enumerate(rois):\n",
    "        # 获取RoI的坐标\n",
    "        y1, x1, y2, x2 = roi\n",
    "\n",
    "        # 计算每个RoI的垂直和水平池化步长\n",
    "        dy = (y2 - y1) / output_size[0]\n",
    "        dx = (x2 - x1) / output_size[1]\n",
    "\n",
    "        # 对每个通道进行池化\n",
    "        for c in range(feature_map.shape[2]):\n",
    "            for y in range(output_size[0]):\n",
    "                for x in range(output_size[1]):\n",
    "                    # 计算当前池化窗口的坐标\n",
    "                    y_start = int(np.round(y1 + y * dy))\n",
    "                    x_start = int(np.round(x1 + x * dx))\n",
    "                    y_end = int(np.round(y_start + dy))\n",
    "                    x_end = int(np.round(x_start + dx))\n",
    "\n",
    "                    # 取出当前池化窗口对应的特征图区域\n",
    "                    patch = feature_map[y_start:y_end, x_start:x_end, c]\n",
    "\n",
    "                    # 对当前池化窗口进行池化\n",
    "                    pooled_rois[i, y, x, c] = np.max(patch)\n",
    "\n",
    "    return pooled_rois\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 12.3.3 Faster R-CNN"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "尽管Fast R-CNN相较于R-CNN在测试时间上实现了显著的提升，从每张图像的47秒缩减至0.32秒，达到了146倍的速度提升，但需要注意的是，这个测试时间并未包含候选区域提取的过程。实际上，候选区域提取正是目标检测速度的瓶颈所在。当考虑候选区域提取所需的时间，例如采用Selective Search作为提取方法时，R-CNN和Fast R-CNN的测试时间分别为每张图像50秒和2秒。因此，Fast R-CNN相比R-CNN的速度提升并非146倍，而是25倍。\n",
    "\n",
    "为了进一步提高目标检测的速度，Ross Girshick和Kaiming He等人于2015年提出了Faster R-CNN[4]。该方法通过引入区域生成网络（Region Proposal Network，RPN）实现了高效的候选区域提取，将目标检测的测试时间减少到每张图像0.2秒（包括候选区域提取的过程），相较于R-CNN实现了250倍的速度提升。Faster R-CNN的整体流程如图12-9所示，主要包括以下两个步骤：\n",
    "1. 通过RPN从图像中提取候选区域；\n",
    "2. 基于Fast R-CNN的框架对这些候选区域进行类别分类与边界框坐标回归；\n",
    "\n",
    "\n",
    "<center>\n",
    "    <img style=\"border-radius: 0.3125em;\" \n",
    "    src=\"https://pic2.zhimg.com/80/v2-b84c8709eca08536b3e07598c494d7cd_1440w.jpg\n",
    "\" width=400>\n",
    "    <br>\n",
    "    <div style=\"color:orange; \n",
    "    display: inline-block;\n",
    "    color: #999;\n",
    "    padding: 2px;\">图12-9 Faster R-CNN流程。</div>\n",
    "</center>\n",
    "\n",
    "相较于Fast R-CNN，Faster R-CNN通过RPN直接从图像的特征图生成若干候选区域，取代了之前的Selective Search方法，实现了端到端的训练模式。简单地说，Faster R-CNN可以被理解为是“RPN + Fast R-CNN”的组合。\n",
    "\n",
    "RPN网络是Faster R-CNN的核心部分，它通过在特征图上的卷积运算替代传统的候选区域生成方法。其框架如图12-9所示。首先，将一张图像进过CNN主干网络的特征图输入到RPN网络中。RPN假定了输入特征图上的每个位置都可在原图上对应着$k$个候选区域，如图12-10所示。以特征图的每一个位置为中心绘制矩形框，即可得到候选区域。为了体现多尺度特性，使用不同大小和不同长宽比的矩形框，这样即使同一物体在图像中发生形变或缩放，仍然存在合适的候选区域能够覆盖该物体。这$k$个候选区域被称作锚框。接下来，将这些锚框映射到特征图的每个位置，从而生成大量的候选区域。\n",
    "\n",
    "\n",
    "<center>\n",
    "    <img style=\"border-radius: 0.3125em;\" \n",
    "    src=\"https://pic1.zhimg.com/80/v2-6ac8980a5e6945af0028a2d86ea1310c_1440w.jpg\n",
    "\" width=500>\n",
    "    <br>\n",
    "    <div style=\"color:orange; \n",
    "    display: inline-block;\n",
    "    color: #999;\n",
    "    padding: 2px;\">图12-10 RPN网络结构。</div>\n",
    "</center>\n",
    "\n",
    "对于每个锚框，RPN网络有两个输出：第一个输出是这个锚框包含物体的可能性以及不包含物体的可能性，所以是两个关于可能性的分数（score）；第二个输出是该锚框对应物体的真实边界框相对于该锚框的坐标偏移量（coordinates），包括中心点坐标偏移量和长宽偏移量，所以RPN网络在特征图上每个位置输出$2k$个可能性分数和$4k$个坐标偏移量。\n",
    "\n",
    "\n",
    "在训练过程中，为了训练RPN网络，需要先给每一个锚框赋予一个真实边界框。这里，将每个锚框与所有的真实边界框进行IoU计算，选择IoU最大的真实边界框与该锚框匹配。接着，根据每一个锚框是前景（即包含需要识别的物体）还是背景对该锚框进行正负样本的划分。对于一张训练图像上的每一个锚框，若该锚框与真实边界框的IoU大于0.7，便认为该锚框中含有物体（即该锚框是正样本）；而当IoU小于0.3时，便认为该锚框中不含物体（即该锚框是负样本）；若IoU介于0.3-0.7之间时，则该锚框不参与网络训练的迭代过程。\n",
    "\n",
    "Faster R-CNN中，对一个锚框而言，RPN的损失函数包括两个部分，一个是分类损失，另一个是坐标回归损失。分类损失主要用于判断每个锚框是否为前景或背景。RPN的分类损失采用的是二分类交叉熵损失函数，公式如下：\n",
    "\n",
    "$$L_{cls} = -[u\\ log(q)+(1-u)\\ log(1-q)]$$\n",
    "\n",
    "其中，$u$表示该锚框的真实类别，若为前景则$u=1$，否则$u=0$，$q$是该锚框为前景的概率。回归损失主要用于修正前景锚框的边界框坐标。RPN的回归损失采用的是平滑$L_1$损失函数，同Fast R-CNN中使用的相同，公式如下：\n",
    "$$L_{reg} =  \\mathbb{1}_{u\\geq 1}\\sum_{j\\in(x,y,h,w)} {\\rm smooth}_{L_1}(t_j-v_j)$$\n",
    "\n",
    "其中，$\\mathbb{1}_{u\\geq 1}$表示该锚框是正样本时才计算该损失，$t_j$是该锚框坐标相对于真实边界框坐标偏移量的预测值，$v_j$是其坐标偏移量的真实值，$\\text{smooth}_{L_1}$是平滑函数。总的损失函数为分类损失和回归损失的加权和，公式如下：\n",
    "\n",
    "$$L = \\lambda L_{cls} +(1-\\lambda)L_{reg}$$\n",
    "\n",
    "其中，$\\lambda$是分类损失和回归损失的权重系数，通常取值为0.5。\n",
    "\n",
    "\n",
    "用这些锚框训练RPN网络，并得到RPN网络的输出后，需要对候选区域进行筛选，挑选出一定数量的高质量候选区域，这里一般选择具有较高的前景概率的候选区域。在筛选出高质量的候选区域后，将它们传递给Fast R-CNN，进行更精细的类别预测和坐标回归。这一步骤的目的是从RPN生成的候选区域中识别出物体的具体类别，同时进一步优化候选框的位置和尺寸。\n",
    "\n",
    "\n",
    "Faster R-CNN的测试阶段与训练阶段类似，也是包含了上述两个步骤。由于生成的候选区域可能包含多个高度重叠的预测边界框，最后需要使用非极大值抑制来合并重叠的预测框。\n",
    "\n",
    "\n",
    "我们来看RPN的代码实现中的一些细节。在代码中，RPN网络被定义为RegionProposalNetwork，输入是使用CNN主干网络提取训练图像得到的特征图，输出是该特征图上的候选区域。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 12.3.3.1 RPN代码整体框架"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们先概览RPN代码的整体框架。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class RegionProposalNetwork(torch.nn.Module):\n",
    "    \n",
    "    def __init__(self,\n",
    "                 # 该模块生成anchor \n",
    "                 anchor_generator,\n",
    "                 # 该模块生成置信度objectness和真实边界框相对于anchor的偏移量pred_box_deltas\n",
    "                 head,\n",
    "                 # fg表示frontgroud即目标，若anchor与gt的iou大于fg_iou_thresh，则被认为目标，默认为0.7\n",
    "                 fg_iou_thresh,\n",
    "                 # bg表示backgroud即背景，若anchor与gt的iou小于bg_iou_thresh，则被认为背景，默认为0.3\n",
    "                 bg_iou_thresh,\n",
    "                 # 训练时需要正负样本平衡，该变量表示每张图像采样batch_size_per_image个样本\n",
    "                 batch_size_per_image,\n",
    "                 # 表示正负样本平衡的正样本比例，正样本数=batch_size_per_image*positive_fraction\n",
    "                 positive_fraction,\n",
    "                 # 在nms前，按置信度排序，最多选取前pre_nms_top_n个proposals送入到nms\n",
    "                 pre_nms_top_n,\n",
    "                 # nms后，按置信度排序，最多选取前post_nms_top_n个proposals送入到roi_head\n",
    "                 post_nms_top_n,\n",
    "                 # nms时，设定的阈值\n",
    "                 nms_thresh\n",
    "    ):\n",
    "        super(RegionProposalNetwork, self).__init__()\n",
    "\n",
    "        self.anchor_generator = anchor_generator\n",
    "        self.head = head\n",
    "        \n",
    "        # boxcoder用来解码编码偏移量 \n",
    "        self.box_coder = det_utils.BoxCoder(weights=(1.0, 1.0, 1.0, 1.0))  \n",
    "        \n",
    "        # 用iou来衡量box之间的相似度\n",
    "        self.box_similarity = box_ops.box_iou  \n",
    "        \n",
    "        # 为每个anchor匹配groud truth\n",
    "        self.proposal_matcher = det_utils.Matcher(  \n",
    "            fg_iou_thresh,\n",
    "            bg_iou_thresh,\n",
    "            allow_low_quality_matches=True,\n",
    "        )\n",
    "        \n",
    "        # 正负样本平衡采样，这里进行随机采样，使得正负样本比例满足positive_fraction\n",
    "        self.fg_bg_sampler = det_utils.BalancedPositiveNegativeSampler( \n",
    "            batch_size_per_image, positive_fraction\n",
    "        )\n",
    "        \n",
    "        self._pre_nms_top_n = pre_nms_top_n    \n",
    "        self._post_nms_top_n = post_nms_top_n  \n",
    "        self.nms_thresh = nms_thresh\n",
    "        \n",
    "        # 当proposal的面积小于min_size，去除该proposal\n",
    "        self.min_size = 1e-3  \n",
    "\n",
    "    def pre_nms_top_n(self):\n",
    "        if self.training:\n",
    "            return self._pre_nms_top_n['training']\n",
    "        return self._pre_nms_top_n['testing']\n",
    "\n",
    "    def post_nms_top_n(self):\n",
    "        if self.training:\n",
    "            return self._post_nms_top_n['training']\n",
    "        return self._post_nms_top_n['testing']\n",
    "\n",
    "    def concat_box_prediction_layers(box_cls, box_regression):\n",
    "        box_cls_flattened = []\n",
    "        box_regression_flattened = []\n",
    "        \n",
    "        for box_cls_per_level, box_regression_per_level in zip(\n",
    "            box_cls, box_regression\n",
    "        ):\n",
    "            N, AxC, H, W = box_cls_per_level.shape\n",
    "            Ax4 = box_regression_per_level.shape[1]\n",
    "            A = Ax4 // 4\n",
    "            C = AxC // A\n",
    "            box_cls_per_level = permute_and_flatten(\n",
    "                 box_cls_per_level, N, A, C, H, W\n",
    "            ) \n",
    "            # 转换为[B,A*H*W,1]，B为Batch_size\n",
    "            box_cls_flattened.append(box_cls_per_level)\n",
    "\n",
    "            box_regression_per_level = permute_and_flatten(\n",
    "                box_regression_per_level, N, A, 4, H, W\n",
    "            ) \n",
    "            # 转换为[B,A*H*W,4]\n",
    "            box_regression_flattened.append(box_regression_per_level)\n",
    "                      \n",
    "        box_cls = torch.cat(box_cls_flattened, dim=1).flatten(0, -2)\n",
    "        box_regression = torch.cat(box_regression_flattened, dim=1).reshape(-1, 4)\n",
    "        \n",
    "        # box_cls:tensor[B*levels*A*H*W,1]，box_regression:tensor[B*levels*A*H*W,4]\n",
    "        return box_cls, box_regression\n",
    "\n",
    "    def forward(self, images, features, targets=None):\n",
    "        # RPN使用所有的feature_maps，注意：roi_head将不使用P6 feature\n",
    "        features = list(features.values())          \n",
    "        \n",
    "        objectness, pred_bbox_deltas = self.head(features)\n",
    "        # objectness为List[tensor[B,1*A,H,W]*levels]，类别的预测概率\n",
    "        # pred_bbox_deltas为List[tensor[B,4*A,H,W]*levels]，候选框相对于真实边界框的偏移量\n",
    "        # levels为FPN的不同尺度特征图的个数，这里就是P2到P6，共5个特征图\n",
    "       \n",
    "        anchors = self.anchor_generator(images, features)\n",
    "       \n",
    "        num_images = len(anchors)\n",
    "        num_anchors_per_level = [o[0].numel() for o in objectness]\n",
    "        objectness, pred_bbox_deltas = concat_box_prediction_layers(objectness, pred_bbox_deltas)\n",
    "        # 注意这里每个level的H，W不全相同\n",
    "\n",
    "        # 这里使用pred_bbox_deltas.detach阻断梯度传播，因为Faster R-CNN交替优化，这里阻断了梯度的传播\n",
    "        # 在训练后续网络（Fast R-CNN）时可以保持RPN网络的参数不更新\n",
    "        proposals = self.box_coder.decode(pred_bbox_deltas.detach(), anchors)\n",
    "               \n",
    "        # 先按置信度排序，选最大的前pre_nms_topn个，并对越界的proposals进行剪裁，去除面积太小的proposals，最后进行nms\n",
    "        proposals = proposals.view(num_images, -1, 4)\n",
    "        boxes, scores = self.filter_proposals(proposals, objectness, images.image_sizes, num_anchors_per_level)\n",
    "\n",
    "        losses = {}\n",
    "        # 注意：只有在训练时才进行以下步骤\n",
    "        if self.training:\n",
    "            assert targets is not None\n",
    "            # 计算每个锚框最匹配的gt\n",
    "            labels, matched_gt_boxes = self.assign_targets_to_anchors(anchors, targets)\n",
    "            '''\n",
    "            若该锚框与真实边界框的IoU大于0.7，我们便认为该锚框中含有物体（即该锚框是正样本）；\n",
    "            而当IoU小于0.3时，便认为该锚框中不含物体（即该锚框是负样本）；\n",
    "            若IoU介于0.3-0.7之间时，则不参与网络训练的迭代过程；\n",
    "            我们用0、-1、1表示锚框的类型。0表示背景，1表示目标，-1表示介于背景和目标。\n",
    "            '''\n",
    "            # regression_target为gt相对于锚框的偏移量\n",
    "            regression_targets = self.box_coder.encode(matched_gt_boxes, anchors)\n",
    "        \n",
    "            loss_objectness, loss_rpn_box_reg = self.compute_loss(\n",
    "                objectness, pred_bbox_deltas, labels, regression_targets)\n",
    "            losses = {\n",
    "                \"loss_objectness\": loss_objectness,\n",
    "                \"loss_rpn_box_reg\": loss_rpn_box_reg,\n",
    "            }\n",
    "        return boxes, losses"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "观察forward()函数，我们不难发现其主要包含四个模块：head，anchor_generator，box_coder和filter_proposals，在训练时还需要assign_targets_to_anchors和compute_loss模块。首先，多尺度的特征图先经过RPN网络的head结构对每个锚框进行正负样本的划分以及候选框的回归。在此之后，利用anchor_generator模块生成大小形状不同的锚框。将锚框与候选框的信息输入进入box_coder模块，可以得到一系列的候选区域，再利用filter_proposal筛选得到高质量的候选区域传递给下一级网络。我们依次进行介绍。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 12.3.3.2 head模块"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# head模块在代码里被封装在RPNHead类中\n",
    "class RPNHead(nn.Module):\n",
    "\n",
    "    def __init__(self, in_channels, num_anchors):\n",
    "        super(RPNHead, self).__init__()\n",
    "        self.conv = nn.Conv2d(\n",
    "            in_channels, in_channels, kernel_size=3, stride=1, padding=1\n",
    "        ) \n",
    "        \n",
    "        # 分别进行分类与候选框回归\n",
    "        \n",
    "        # 改变channel数，从in_chanenls变成num_anchors\n",
    "        # 在图 12-9 中，我们会发现输出的维度是num_anchors*2, 这里没有乘2是由于使用的损失函数不同\n",
    "        # 一个使用的是多分类交叉熵损失，一个使用的是二值交叉熵损失\n",
    "        self.cls_logits = nn.Conv2d(in_channels, num_anchors, kernel_size=1, stride=1)\n",
    "        \n",
    "        # 改变channel数，从in_chanenls变成num_anchors*4\n",
    "        # 这里预测的是预测框相对于锚框的偏移量，在之后会进行解释\n",
    "        self.bbox_pred = nn.Conv2d(\n",
    "            in_channels, num_anchors * 4, kernel_size=1, stride=1\n",
    "        )\n",
    "\n",
    "        # 初始化参数\n",
    "        for l in self.children():\n",
    "            torch.nn.init.normal_(l.weight, std=0.01)\n",
    "            torch.nn.init.constant_(l.bias, 0)\n",
    "\n",
    "    def forward(self, x):\n",
    "        # x为多尺度的特征图\n",
    "        logits = []\n",
    "        bbox_reg = []\n",
    "        for feature in x:\n",
    "            # feature是一个[B,C,H,W]的tensor\n",
    "            # 先经过卷积层，该卷积层不改变feature的尺度\n",
    "            t = F.relu(self.conv(feature))\n",
    "            # tensor[B,A*1,H,W]，A==num_anchor，表示每个grid生成A个anchor\n",
    "            logits.append(self.cls_logits(t))\n",
    "            # tensor[B、A*4、H、W]\n",
    "            bbox_reg.append(self.bbox_pred(t))\n",
    "        return logits, bbox_reg"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "通过代码我们可以知道，Faster R-CNN在每一个栅格（grid）的位置生成$A$个锚框，并对这$A$×$H$×$W$个锚框进行分类和坐标回归，其中$H$和$W$为特征图对应的高和宽。那么我们该如何生成锚框呢？"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 12.3.3.3 anchor_generator模块"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# anchor_generator模块被封装在AnchorGenerator类中\n",
    "class AnchorGenerator(nn.Module):\n",
    "\n",
    "    def __init__(\n",
    "        self,\n",
    "        # 为不同的特征图设置不一样大小的锚框\n",
    "        sizes=(32, 64, 128, 256, 512),\n",
    "        # 长宽比\n",
    "        aspect_ratios=(0.5, 1.0, 2.0),\n",
    "    ):\n",
    "        super(AnchorGenerator, self).__init__()\n",
    "\n",
    "        if not isinstance(sizes[0], (list, tuple)):\n",
    "            sizes = tuple((s,) for s in sizes) \n",
    "        # 转换为[(32,), (64,), (128,), (256,), (512,)]\n",
    "        \n",
    "        if not isinstance(aspect_ratios[0], (list, tuple)):\n",
    "            aspect_ratios = (aspect_ratios,) * len(sizes) \n",
    "        assert len(sizes) == len(aspect_ratios)\n",
    "\n",
    "        self.sizes = sizes\n",
    "        self.aspect_ratios = aspect_ratios\n",
    "        self.cell_anchors = None\n",
    "        self._cache = {}\n",
    " \n",
    "    def forward(self, image_list, feature_maps):\n",
    "    \n",
    "        # grid_sizes记录feature_maps的高和宽\n",
    "        grid_sizes = list([feature_map.shape[-2:] for feature_map in feature_maps])\n",
    "\n",
    "        # 记录input图像的高和宽\n",
    "        image_size = image_list.tensors.shape[-2:]\n",
    "        \n",
    "        # stride等于图像的尺度除以特征图的尺度\n",
    "        strides = [[int(image_size[0] / g[0]), int(image_size[1] / g[1])] for g in grid_sizes]\n",
    "\n",
    "        dtype, device = feature_maps[0].dtype, feature_maps[0].device\n",
    "        \n",
    "        # 1.设置anchors的H和W，此时anchor中心点都在(0,0)\n",
    "        self.set_cell_anchors(dtype, device)\n",
    "        \n",
    "        # 2.设置anchors的中心点，即把anchor中心的点进行平移\n",
    "        anchors_over_all_feature_maps = self.cached_grid_anchors(grid_sizes, strides)\n",
    "\n",
    "        anchors = torch.jit.annotate(List[List[torch.Tensor]], [])\n",
    "        for i, (image_height, image_width) in enumerate(image_list.image_sizes):\n",
    "            anchors_in_image = []\n",
    "            for anchors_per_feature_map in anchors_over_all_feature_maps:\n",
    "                anchors_in_image.append(anchors_per_feature_map)\n",
    "            anchors.append(anchors_in_image)\n",
    "\n",
    "        anchors = [torch.cat(anchors_per_image) for anchors_per_image in anchors]\n",
    "        # 清除缓存，释放内存\n",
    "        self._cache.clear()\n",
    "        \n",
    "        # anchors: List[tensor[levels*A*H*W,4]*Batch],4即x1,y1,x2,y2（左上角和右下角）,A表示每个grid有多少个检测框\n",
    "        return anchors\n",
    "\n",
    "    # 1.设置anchors的H和W，此时anchor中心点都在(0,0)\n",
    "    def set_cell_anchors(self, dtype, device):\n",
    "        if self.cell_anchors is not None:\n",
    "            return\n",
    "\n",
    "        cell_anchors = [\n",
    "            self.generate_anchors( \n",
    "            # generate_anchors负责给定一组sizes和aspect_ratios，返回一组x1,y1,x2,y2\n",
    "            # 注：此时中心点为(0,0)，即x1+x2=0,y1+y2=0\n",
    "                sizes,\n",
    "                aspect_ratios,\n",
    "                dtype,\n",
    "                device\n",
    "            )\n",
    "            for sizes, aspect_ratios in zip(self.sizes, self.aspect_ratios)\n",
    "        ]\n",
    "        # 为方便之后grid_anchors函数调用，保存在self.cell_anchors\n",
    "        self.cell_anchors = cell_anchors\n",
    "\n",
    "    # 2.设置anchors的中心点，即把anchor中心的点进行平移\n",
    "    def grid_anchors(self, grid_sizes, strides):\n",
    "    \n",
    "        anchors = []\n",
    "        # 导入第一步中的结果\n",
    "        cell_anchors = self.cell_anchors\n",
    "        # 此时第一步set_cell_anchors返回的结果已经设置完长宽，还没有设置中心点\n",
    "        \n",
    "        assert cell_anchors is not None\n",
    "\n",
    "        for size, stride, base_anchors in zip(grid_sizes, strides, cell_anchors):\n",
    "            grid_height, grid_width = size\n",
    "            stride_height, stride_width = stride\n",
    "            device = base_anchors.device\n",
    "            \n",
    "            # 将在后续详解\n",
    "            shifts_x = torch.arange(0, grid_width, dtype=torch.float32, device=device) * stride_width\n",
    "            shifts_y = torch.arange(0, grid_height, dtype=torch.float32, device=device) * stride_height\n",
    "\n",
    "            shift_y, shift_x = torch.meshgrid(shifts_y, shifts_x)\n",
    "            shift_x = shift_x.reshape(-1)\n",
    "            shift_y = shift_y.reshape(-1)\n",
    "            shifts = torch.stack((shift_x, shift_y, shift_x, shift_y), dim=1)\n",
    "\n",
    "            # shifts为偏移量：表示anchor中心点从(0,0)移到(shift_x,shift_y)\n",
    "            anchors.append((shifts.view(-1, 1, 4) + base_anchors.view(1, -1, 4)).reshape(-1, 4))\n",
    "\n",
    "        return anchors\n",
    "    \n",
    "    def generate_anchors(self, scales, aspect_ratios, dtype=torch.float32, device=\"cpu\"):\n",
    "        \n",
    "        scales = torch.as_tensor(scales, dtype=dtype, device=device)\n",
    "        # aspect_ratios是长宽比\n",
    "        aspect_ratios = torch.as_tensor(aspect_ratios, dtype=dtype, device=device)\n",
    "\n",
    "        # 当h_ratios=sqrt(aspect_ratios)，w_ratios=1/sqrt(aspect_ratios)时\n",
    "        # h_ratios/w_ratios==aspect_ratios\n",
    "        h_ratios = torch.sqrt(aspect_ratios) \n",
    "        w_ratios = 1 / h_ratios\n",
    "\n",
    "        ws = (w_ratios[:, None] * scales[None, :]).view(-1)\n",
    "        hs = (h_ratios[:, None] * scales[None, :]).view(-1)\n",
    "\n",
    "        # x1,y1,x2,y2（左上角右下角），中心点为（0,0） \n",
    "        base_anchors = torch.stack([-ws, -hs, ws, hs], dim=1) / 2\n",
    "        return base_anchors.round()\n",
    "\n",
    "    def cached_grid_anchors(self, grid_sizes, strides):\n",
    "        key = str(grid_sizes + strides)\n",
    "        if key in self._cache:\n",
    "            return self._cache[key]\n",
    "        anchors = self.grid_anchors(grid_sizes, strides)\n",
    "        self._cache[key] = anchors\n",
    "        return anchors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在这一模块中，首先需要设置锚框的高宽以及大小，并在特征图(0，0)点的位置生成一系列锚框，接着平移锚框的中心位置，为每一个栅格生成一系列锚框。\n",
    "\n",
    "在代码的初始化部分中，设置了一系列大小不同的尺度（代码中为sizes）。这里，每一个尺度表示生成锚框的空间分辨率大小，如“size=32”表示生成锚框的空间分辨率大小应该是$32\\times32$。主干网络提取得到的特征是多尺度的，越深层次的特征图包含的语义信息越丰富，感受野也越大，更适合大尺寸目标的检测，因此在网络的最深层设置空间分辨率大的锚框，即$512\\times512$；随着网络层次的变浅，设置的锚框的空间分辨率也依次减小。在实践中，利用ResNet50_with_fpn提取训练图像的特征可以得到5个不同大小的特征图，因此将尺度的大小设置为32，64，128，256，512。aspect_ratio是高宽比，控制着生成锚框的形状。如当生成锚框的空间分辨率为$32\\times32$，选取的aspect_ratio为0.5时，生成锚框的长和宽为23，45。通过上述描述，我们可以知道锚框其实是利用特征图和原图像的对应关系在输入图像上进行位置的确定，锚框的坐标是建立在输入图像上的。\n",
    "\n",
    "在(0，0)位置生成锚框之后，需要将锚框的中心点平移到特征图中任意一个位置(x，y)并进行锚框的生成。同时，由于特征图的多尺度性（即每个特征图尺度大小不一），需要为不同尺度的特征图设置不同的中心点。这里的核心代码为："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "shifts_x = torch.arange(0, grid_width, dtype=torch.float32, device=device) * stride_width\n",
    "shifts_y = torch.arange(0, grid_height, dtype=torch.float32, device=device) * stride_height"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "其中，grid_width和grid_height为特征图的宽和高，stride_width和stride_height定义为输入图像的宽（或高）除以特征图的宽（或高）。shifts_x和shifts_y分别是每一次在x方向和y方向平移的长度。我们通过一个简单的例子来进行说明：假设输入图像的尺寸为(8，8)，特征图的尺寸为(2,2)和(4,4)。对于大小为(2,2)的特征图，stride_width和stride_height的大小为4，我们便可以得到shifts_x和shifts_y为[0,4]。也就是说，我们需要以特征图的(0,0)，(0,4)，(4,0)，(4,4)这4个位置为中心点，为每个位置生成3个（aspect_ration为0.5，1，2）形状不同的锚框；而对于大小为(4,4)的特征图，通过计算可以得到shifts_x和shifts_y为[0,2,4,6]，因此我们需要以16个位置为中心点生成锚框。\n",
    "\n",
    "至此，我们已经得到每一个锚框的类别与坐标回归信息。那么，我们该如何将其结合起来，生成候选区域呢？"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 12.3.3.4 box_coder 模块"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在RegionProposalNetwork类的forward()函数里，box_coder出现在以下代码中："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 我们将上一步生成的锚框信息和预测框信息输入进box_coder的decoder函数中进行候选区域（proposal)的生成\n",
    "proposals = self.box_coder.decode(pred_bbox_deltas.detach(), anchors)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 计算真实框和锚框之间的偏移关系\n",
    "regression_targets = self.box_coder.encode(matched_gt_boxes, anchors)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们先看一看box_coder模块是如何实现的："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class BoxCoder(object):\n",
    "    def __init__(self, weights, bbox_xform_clip=math.log(1000. / 16)):\n",
    "        # type: (Tuple[float, float, float, float], float)\n",
    "\n",
    "        self.weights = weights\n",
    "        self.bbox_xform_clip = bbox_xform_clip\n",
    "        # 在解码时，求预测框的长宽（ph，pw）时需要用到bbox_xform_clip\n",
    "        # (dh,dw)最大为log(img_size_max/anchor_size_min)=log(1000. / 16)\n",
    "    \n",
    "    # decoder模块\n",
    "    #######################################################\n",
    "    def decode(self, rel_codes, boxes):\n",
    "        # type: (Tensor, List[Tensor])\n",
    "        assert isinstance(boxes, (list, tuple))\n",
    "        assert isinstance(rel_codes, torch.Tensor)\n",
    "        boxes_per_image = [b.size(0) for b in boxes]\n",
    "\n",
    "        concat_boxes = torch.cat(boxes, dim=0)\n",
    "        box_sum = 0\n",
    "        for val in boxes_per_image:\n",
    "            box_sum += val\n",
    "        pred_boxes = self.decode_single(rel_codes.reshape(box_sum, -1), concat_boxes)\n",
    "        # rel_codes的shape为(B*levels*A*H*W,4)\n",
    "\n",
    "        return pred_boxes.reshape(box_sum, -1, 4)\n",
    "\n",
    "    def decode_single(self, rel_codes, boxes):\n",
    "        # 根据预测框相对于anchode的偏移量，和anchor，计算出预测框（x1,y1,x2,y2）格式\n",
    "        # 输入：rel_codes即预测框相对于anchode的偏移量，boxes即anchor\n",
    "        boxes = boxes.to(rel_codes.dtype)\n",
    "\n",
    "        widths = boxes[:, 2] - boxes[:, 0]\n",
    "        heights = boxes[:, 3] - boxes[:, 1]\n",
    "        ctr_x = boxes[:, 0] + 0.5 * widths\n",
    "        ctr_y = boxes[:, 1] + 0.5 * heights\n",
    "\n",
    "        wx, wy, ww, wh = self.weights\n",
    "        dx = rel_codes[:, 0::4] / wx\n",
    "        dy = rel_codes[:, 1::4] / wy\n",
    "        dw = rel_codes[:, 2::4] / ww\n",
    "        dh = rel_codes[:, 3::4] / wh\n",
    "\n",
    "        # 这里的clamp是去除大的离谱的预测框，并不能保证预测框完全不越界，无妨后面filter_proposal还会做一次去除越界\n",
    "        dw = torch.clamp(dw, max=self.bbox_xform_clip)\n",
    "        # bbox_xform_clip详见__init__，设定为log(img_size_max/anchor_size_min),因为dw肯定小于log(img_size_max/anchor_size_min)\n",
    "        dh = torch.clamp(dh, max=self.bbox_xform_clip)\n",
    "\n",
    "        pred_ctr_x = dx * widths[:, None] + ctr_x[:, None]\n",
    "        pred_ctr_y = dy * heights[:, None] + ctr_y[:, None]\n",
    "        pred_w = torch.exp(dw) * widths[:, None]\n",
    "        pred_h = torch.exp(dh) * heights[:, None]\n",
    "       \n",
    "        # 转换为(x1,y1,x2,y2)格式\n",
    "        pred_boxes1 = pred_ctr_x - torch.tensor(0.5, dtype=pred_ctr_x.dtype) * pred_w\n",
    "        pred_boxes2 = pred_ctr_y - torch.tensor(0.5, dtype=pred_ctr_y.dtype) * pred_h\n",
    "        pred_boxes3 = pred_ctr_x + torch.tensor(0.5, dtype=pred_ctr_x.dtype) * pred_w\n",
    "        pred_boxes4 = pred_ctr_y + torch.tensor(0.5, dtype=pred_ctr_y.dtype) * pred_h\n",
    "        pred_boxes = torch.stack((pred_boxes1, pred_boxes2, pred_boxes3, pred_boxes4), dim=2).flatten(1)\n",
    "        return pred_boxes\n",
    "    \n",
    "    #######################################################\n",
    "\n",
    "    # encoder模块\n",
    "    #######################################################\n",
    "    def encode(self, reference_boxes, proposals):\n",
    "        # 这里reference_boxes就是gt框，proposals是anchor锚框\n",
    "        boxes_per_image = [len(b) for b in reference_boxes]\n",
    "        reference_boxes = torch.cat(reference_boxes, dim=0) \n",
    "        proposals = torch.cat(proposals, dim=0)\n",
    "        targets = self.encode_single(reference_boxes, proposals)\n",
    "        return targets.split(boxes_per_image, 0)\n",
    "\n",
    "    def encode_single(self, reference_boxes, proposals):\n",
    "        dtype = reference_boxes.dtype\n",
    "        device = reference_boxes.device\n",
    "        weights = torch.as_tensor(self.weights, dtype=dtype, device=device)\n",
    "        targets = encode_boxes(reference_boxes, proposals, weights) # encode\n",
    "        return targets\n",
    "\n",
    "    @torch.jit.script\n",
    "    def encode_boxes(reference_boxes, proposals, weights):\n",
    "        # 这里reference_boxes就是gt框，proposals是anchor锚框\n",
    "        # 求gt相对于anchor的偏移量，并且anchor和gt都是（x1,y1，x2，y2）格式的需要转换\n",
    "        # type: (torch.Tensor, torch.Tensor, torch.Tensor) -> torch.Tensor\n",
    "        \n",
    "        wx = weights[0]\n",
    "        wy = weights[1]\n",
    "        ww = weights[2]\n",
    "        wh = weights[3]\n",
    "        # 这里就是__init__时的weight向量，到底有什么用呢？\n",
    "        # 为了平衡bbox回归loss和分类loss，避免回归loss远小于分类loss\n",
    "    \n",
    "        proposals_x1 = proposals[:, 0].unsqueeze(1)\n",
    "        proposals_y1 = proposals[:, 1].unsqueeze(1)\n",
    "        proposals_x2 = proposals[:, 2].unsqueeze(1)\n",
    "        proposals_y2 = proposals[:, 3].unsqueeze(1)\n",
    "    \n",
    "        reference_boxes_x1 = reference_boxes[:, 0].unsqueeze(1)\n",
    "        reference_boxes_y1 = reference_boxes[:, 1].unsqueeze(1)\n",
    "        reference_boxes_x2 = reference_boxes[:, 2].unsqueeze(1)\n",
    "        reference_boxes_y2 = reference_boxes[:, 3].unsqueeze(1)\n",
    "    \n",
    "        # 求anchor框（x，y，w，h）格式\n",
    "        ex_widths = proposals_x2 - proposals_x1\n",
    "        ex_heights = proposals_y2 - proposals_y1\n",
    "        ex_ctr_x = proposals_x1 + 0.5 * ex_widths\n",
    "        ex_ctr_y = proposals_y1 + 0.5 * ex_heights\n",
    "\n",
    "        # 求gt框（x，y，w，h）格式\n",
    "        gt_widths = reference_boxes_x2 - reference_boxes_x1\n",
    "        gt_heights = reference_boxes_y2 - reference_boxes_y1\n",
    "        gt_ctr_x = reference_boxes_x1 + 0.5 * gt_widths\n",
    "        gt_ctr_y = reference_boxes_y1 + 0.5 * gt_heights\n",
    "        \n",
    "        # 依据图1公式算gt相对于anchor的偏移量\n",
    "        targets_dx = wx * (gt_ctr_x - ex_ctr_x) / ex_widths\n",
    "        targets_dy = wy * (gt_ctr_y - ex_ctr_y) / ex_heights\n",
    "        targets_dw = ww * torch.log(gt_widths / ex_widths)\n",
    "        targets_dh = wh * torch.log(gt_heights / ex_heights)\n",
    "    \n",
    "        targets = torch.cat((targets_dx, targets_dy, targets_dw, targets_dh), dim=1)\n",
    "        return targets\n",
    "    \n",
    "    #######################################################\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如图 12-11 所示，在RPN网络中，通过RPNHead网络可以得到每个栅格A个锚框的前景概率与坐标偏移量信息，通过anchor_generator为每个栅格生成锚框之后，便可以利用偏移量信息得到每一个预测框的坐标信息。在代码中这一过程被称为“解码”，其中$t$表示预测的偏移量。同样的，在进行坐标回归时，需要计算真实边界框与锚框的偏移量，在代码中这一过程被称为”编码“。在上述过程中，只有4个偏移量参数$v_*$会影响到预测框的坐标，锚框本身并不影响。锚框之所以被称为“锚”框，是因为它的位置是固定的。\n",
    "\n",
    "<center>\n",
    "    <img style=\"border-radius: 0.3125em;\" \n",
    "    src=\"https://pic2.zhimg.com/80/v2-2fce689b9eb72e222ce2033ed42e5fe9_1440w.webp\n",
    "\" width=500>\n",
    "    <br>\n",
    "    <div style=\"color:orange; \n",
    "    display: inline-block;\n",
    "    color: #999;\n",
    "    padding: 2px;\">图12-11 RPN网络结构。</div>\n",
    "</center>\n",
    "\n",
    "我们观察box_coder里的decoder部分："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pred_ctr_x = dx * widths[:, None] + ctr_x[:, None]\n",
    "pred_ctr_y = dy * heights[:, None] + ctr_y[:, None]\n",
    "pred_w = torch.exp(dw) * widths[:, None]\n",
    "pred_h = torch.exp(dh) * heights[:, None]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "你是不是发现了，这四行代码正是与解码的四行公式相对应。\n",
    "\n",
    "除此之外，在RegionProposalNetwork中，我们会发现传入的信息是pred_bbox_deltas.detach()："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "proposals = self.box_coder.decode(pred_bbox_deltas.detach(), anchors)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这是因为Faster R-CNN采用交替训练的方式进行模型的训练，其训练流程如下：\n",
    "\n",
    "1. 初始化Faster R-CNN模型的所有参数，包括CNN、RPN和Fast R-CNN的参数。\n",
    "2. 固定CNN的参数，只训练RPN网络。在训练过程中，使用当前模型的所有参数（包括RPN和Fast R-CNN的参数）来生成候选框，然后使用这些候选框来训练RPN。在RPN的训练过程中，只计算区域回归的损失。\n",
    "3. 固定RPN网络的参数，只训练Fast R-CNN网络。在训练过程中，使用当前模型的所有参数（包括CNN和RPN的参数）来提取候选框的特征，然后使用这些特征来训练Fast R-CNN网络。在Fast R-CNN网络的训练过程中，同时计算区域分类和坐标回归的损失。\n",
    "4. 重复步骤2和3，交替训练RPN和Fast R-CNN网络，直到模型收敛或达到指定的训练轮数。\n",
    "\n",
    "Fast R-CNN需要分开训练RPN网络和之后的Fast R-CNN网络，即参数在两个模型之间不互相传播。因此这里使用detach，在训练之后模型的时候便会冻结RPN网络，保持其参数不变。\n",
    "\n",
    "在得到候选区域之后们需要对其进行过滤，选择合适的候选区域传递给下一级网络。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 12.3.3.5 filter_proposal模块"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "filter_proposal模块对上一步得到的候选区域进行过滤，主要包括以下几步：\n",
    "1. 根据前景分数（score，在代码中为objectness）降序排序候选区域，选择前pre_nms_topn个；\n",
    "2. 对超出图像范围的候选区域进行裁剪；\n",
    "3. 去除面积过小的候选区域；\n",
    "4. 进行nms去除冗余的候选区域；\n",
    "5. 对nms的结果根据置信度进行降序排序，输出前post_nms_topn个候选区域。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    " def filter_proposals(self, proposals, objectness, image_shapes, num_anchors_per_level):\n",
    "        # objectness为tensor[B*levels*H*W*A,1]\n",
    "        # num_anchors_per_level为List([A*H*W])，A为每个grid生成的anchor数量，H、W为该level特征图的size\n",
    "        num_images = proposals.shape[0]\n",
    "        device = proposals.device\n",
    "        \n",
    "        objectness = objectness.detach()  \n",
    "        objectness = objectness.reshape(num_images, -1)\n",
    "\n",
    "        levels = [\n",
    "            torch.full((n,), idx, dtype=torch.int64, device=device)\n",
    "            for idx, n in enumerate(num_anchors_per_level)\n",
    "        ]  \n",
    "        # 记录anchor的所在level\n",
    "        \n",
    "        # (levels*H*W*A,1)，1是该anchor所在的level\n",
    "        levels = torch.cat(levels, 0)\n",
    "        levels = levels.reshape(1, -1).expand_as(objectness)\n",
    "\n",
    "        top_n_idx = self._get_top_n_idx(objectness, num_anchors_per_level)\n",
    "        # 在nms之前先进行过滤\n",
    "        # 根据置信度对同一level特征图产生的proposals进行降序排序，最多选择前pre_nms_topn（默认2000）\n",
    "        # 返回下标索引\n",
    "        image_range = torch.arange(num_images, device=device)\n",
    "        batch_idx = image_range[:, None]\n",
    "\n",
    "        objectness = objectness[batch_idx, top_n_idx]\n",
    "        levels = levels[batch_idx, top_n_idx]\n",
    "        proposals = proposals[batch_idx, top_n_idx]\n",
    "        \n",
    "        final_scores = []\n",
    "        final_boxes = []\n",
    "        for boxes, scores, lvl, img_shape in zip(proposals, objectness, levels, image_shapes):\n",
    "            # clip越界\n",
    "            boxes = box_ops.clip_boxes_to_image(boxes, img_shape) \n",
    "            # 去除面积太小的proposals框\n",
    "            keep = box_ops.remove_small_boxes(boxes, self.min_size)\n",
    "            boxes, scores, lvl = boxes[keep], scores[keep], lvl[keep]\n",
    "            \n",
    "            # 进行nms操作，注意这里在不同level的feature_map上产生的proposal，它们之间独立地进行nms操作。详见batch_nms源码\n",
    "            # 返回的keep是经过置信度降序排序的下标索引\n",
    "            keep = box_ops.batched_nms(boxes, scores, lvl, self.nms_thresh) \n",
    "            \n",
    "            #最多返回前post_nms_topn（默认2000）个proposals，若nms后bbox数量小于post_nms_topn，则全部返回\n",
    "            keep = keep[:self.post_nms_top_n()] \n",
    "            boxes, scores = boxes[keep], scores[keep]\n",
    "            final_boxes.append(boxes)\n",
    "            final_scores.append(scores)\n",
    "        return final_boxes, final_scores\n",
    "\n",
    "    def _get_top_n_idx(self, objectness, num_anchors_per_level):\n",
    "        \n",
    "        r = []\n",
    "        offset = 0\n",
    "        for ob in objectness.split(num_anchors_per_level, 1):\n",
    "            # 对不同level的特征图上的产生的proposals进行独立排序\n",
    "            # ob为tensor(B,H*W*A)\n",
    "            \n",
    "            num_anchors = ob.shape[1]\n",
    "            pre_nms_top_n = min(self.pre_nms_top_n(), num_anchors)\n",
    "            # 对每一张图像的每一level特征图上产生的proposals\n",
    "            # 按置信度降序，选择最大的前pre_nms_top_n个proposals\n",
    "            # 若数量小于pre_nms_top_n则全部返回\n",
    "            \n",
    "            # 记录索引\n",
    "            _, top_n_idx = ob.topk(pre_nms_top_n, dim=1) \n",
    "            r.append(top_n_idx + offset)\n",
    "            offset += num_anchors\n",
    "            \n",
    "            # 返回索引\n",
    "        return torch.cat(r, dim=1) \n",
    "    \n",
    "    \n",
    "def batched_nms(boxes, scores, idxs, iou_threshold):\n",
    "    # boxes即为proposals：tensor[levels*pre_nms_top_n,4]，(x1, y1, x2, y2) 格式\n",
    "    # scores为排序依据，rpn为置信度，roi_head为分类概率\n",
    "    # idx：当在rpn调用时，为proposals所在的level；当在roi_heads调用时，为proposals的类别\n",
    "    \n",
    "    if boxes.numel() == 0:\n",
    "        return torch.empty((0,), dtype=torch.int64, device=boxes.device)\n",
    "    max_coordinate = boxes.max() \n",
    "    # idxs在rpn和roi_head代表不同含义\n",
    "    offsets = idxs.to(boxes) * (max_coordinate + 1)\n",
    "    \n",
    "    # 变化proposals的坐标，对proposal进行平移\n",
    "    boxes_for_nms = boxes + offsets[:, None]\n",
    "    \n",
    "    # 返回的下标索引按置信度降序排序\n",
    "    keep = nms(boxes_for_nms, scores, iou_threshold)\n",
    "    return keep "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "需要注意的是，在对上述代码块第一步中的生成的候选区域进行操作时，操作的是同一层次（深度）的特征图上的候选区域。换言之，不同层次的特征图产生的候选区域彼此独立。同样，在之后的操作中，也是对同一层次的特征图产生的候选区域进行操作。此外，pre_nms_topn和post_nms_topn在训练和推理时的取值野不同，一般在推理时取1000，训练时取2000。这也很容易理解，模型在推理时需要兼顾速度，因此选取更少的候选框，而在训练时需要训练更多的样本，因此需要选择更多的候选框。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 12.3.3.6 RPN损失函数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在RegionProposalNetwork的forward()函数里，还有着assign_targets_to_anchors和compute_loss模块。这两个模块在对RPN模型进行训练时起到损失函数计算的作用。我们先介绍assign_targets_to_anchors模块。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def assign_targets_to_anchors(self, anchors, targets):\n",
    "        labels = []\n",
    "        matched_gt_boxes = []\n",
    "        for anchors_per_image, targets_per_image in zip(anchors, targets):\n",
    "            gt_boxes = targets_per_image[\"boxes\"]\n",
    "            print(gt_boxes[0])\n",
    "            match_quality_matrix = box_ops.box_iou(gt_boxes, anchors_per_image)\n",
    "            # 返回match_quality_matrix为iou矩阵，match_quality_matrix(i,j)表示第i个gt与第j个anchor的iou\n",
    "            # shape的为（M，N），M为gt的个数，N为anchor的个数\n",
    "\n",
    "            matched_idxs = self.proposal_matcher(match_quality_matrix)\n",
    "            # 返回每个anchors匹配的gt\n",
    "            # 小于low_threshold(即背景)为-1；between low_threshols and high_threshold(抛弃)为-2；\n",
    "            # 大于high_threshold(即目标)为匹配的gt的下标索引\n",
    "\n",
    "            matched_gt_boxes_per_image = gt_boxes[matched_idxs.clamp(min=0)]\n",
    "            # clamp操作把负数变0，所以matched_gt_boxes_per_image为所有被认为是目标的anchors对应的最大iou的gt\n",
    "\n",
    "            labels_per_image = matched_idxs >= 0\n",
    "            labels_per_image = labels_per_image.to(dtype=torch.float32)\n",
    "\n",
    "            # 认为是背景的anchor下标\n",
    "            bg_indices = matched_idxs == self.proposal_matcher.BELOW_LOW_THRESHOLD # 默认-1\n",
    "            labels_per_image[bg_indices] = torch.tensor(0.0)\n",
    "\n",
    "            # 被忽视的anchor下标\n",
    "            inds_to_discard = matched_idxs == self.proposal_matcher.BETWEEN_THRESHOLDS # 默认-2\n",
    "            labels_per_image[inds_to_discard] = torch.tensor(-1.0)\n",
    "\n",
    "            labels.append(labels_per_image)\n",
    "            matched_gt_boxes.append(matched_gt_boxes_per_image)\n",
    "        # labels:list(labels_per_image)\n",
    "        # labels:最后取值为0、-1、1。 0表示背景，1表示目标，-1表示表示其他，不参与训练\n",
    "        return labels, matched_gt_boxes\n",
    "\n",
    "    \n",
    "# 计算 box iou\n",
    "def box_iou(boxes1, boxes2):\n",
    "    \n",
    "    area1 = box_area(boxes1)\n",
    "    area2 = box_area(boxes2)\n",
    "\n",
    "    lt = torch.max(boxes1[:, None, :2], boxes2[:, :2]) \n",
    "    # 计算iou左上角\n",
    "    # lt的shape为(N,M,2)，lt[n][m]表示boxes1的第n个box和boxes2的第m个box的交集的左上角坐标\n",
    "\n",
    "    rb = torch.min(boxes1[:, None, 2:], boxes2[:, 2:])\n",
    "    # 计算iou右下角,方法同上\n",
    "    # rb的shape为(N,M,2)，rb[n][m]表示boxes1的第n个box和boxes2的第m个box的交集的右下角坐标\n",
    "\n",
    "    wh = (rb - lt).clamp(min=0)  # 如图2\n",
    "    inter = wh[:, :, 0] * wh[:, :, 1]  \n",
    "    # inter的shape为(N,M),inter[n][m]表示boxes1中第n个box与boxes2中第m个box的交集面积\n",
    "\n",
    "    # iou等于交集/并集\n",
    "    iou = inter / (area1[:, None] + area2 - inter)  \n",
    "    return iou"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在这一步中，给每一个锚框添加了真实类别信息。若该锚框与真实边界框的IoU大于0.7，便认为该锚框为正样本，标签为1；当IoU小于0.3时，便认为该锚框是负样本，标签为0；除此之外的锚框标签为-1，不参与网络训练的迭代过程。同时，需要为每一个锚框寻找到了最匹配的真实框，这里使用proposal_matcher进行匹配：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Matcher(object):\n",
    "    \"\"\"\n",
    "    实现锚框与真实边界框的配对，为每一个锚框匹配一个box iou最大的真实边界框。\n",
    "    匹配完成之后，比较box iou与阈值的关系，判断该锚框的label （0，1，-1）\n",
    "    \"\"\"\n",
    "    def __init__(self, high_threshold, low_threshold, allow_low_quality_matches=False):\n",
    "\n",
    "        self.BELOW_LOW_THRESHOLD = -1\n",
    "        self.BETWEEN_THRESHOLDS = -2 \n",
    "        # 这两个取值必须小于0，因为索引从0开始\n",
    "        \n",
    "        assert low_threshold <= high_threshold\n",
    "        self.high_threshold = high_threshold\n",
    "        \"\"\"\n",
    "        allow_low_quality_matches (bool):\n",
    "            如果值为真，允许锚框匹配上小iou的真实边界框，因为可能有一个真实边界框与所有的锚框之间的iou都小于阈值\n",
    "            为了让每个真实边界框都有与之配对的锚框，则配对该真实边界框和与之iou最大锚框\n",
    "        \"\"\"\n",
    "        self.low_threshold = low_threshold\n",
    "        self.allow_low_quality_matches = allow_low_quality_matches\n",
    "\n",
    "    def __call__(self, match_quality_matrix):\n",
    "        \n",
    "        # 给定anchor，找与之iou最大的gt，mathed_vals为每列的最大值，mathes为每列最大值的索引\n",
    "        matched_vals, matches = match_quality_matrix.max(dim=0)\n",
    "        \n",
    "        if self.allow_low_quality_matches:\n",
    "            all_matches = matches.clone()\n",
    "            # 临时值，再剔除iou低的索引前，先保存一下，以便后面set_low_quality_matches_()恢复matches\n",
    "        else:\n",
    "            all_matches = None\n",
    "\n",
    "        below_low_threshold = matched_vals < self.low_threshold\n",
    "        between_thresholds = (matched_vals >= self.low_threshold) & (matched_vals < self.high_threshold)\n",
    "        \n",
    "        # 选出背景\n",
    "        matches[below_low_threshold] = torch.tensor(self.BELOW_LOW_THRESHOLD) \n",
    "        # 选出背景与目标之间\n",
    "        matches[between_thresholds] = torch.tensor(self.BETWEEN_THRESHOLDS)   \n",
    "\n",
    "        if self.allow_low_quality_matches:\n",
    "            assert all_matches is not None\n",
    "            self.set_low_quality_matches_(matches, all_matches, match_quality_matrix)\n",
    "            # 给定gt，与之对应的最大iou的anchor，即便iou小于阈值，也把它作为目标\n",
    "\n",
    "        return matches\n",
    "\n",
    "    def set_low_quality_matches_(self, matches, all_matches, match_quality_matrix):\n",
    "\n",
    "        # 对每一个真实边界框，找到一个与之iou最大的锚框\n",
    "        highest_quality_foreach_gt, _ = match_quality_matrix.max(dim=1)\n",
    "        gt_pred_pairs_of_highest_quality = torch.nonzero(\n",
    "            match_quality_matrix == highest_quality_foreach_gt[:, None]\n",
    "        )\n",
    "        # Example gt_pred_pairs_of_highest_quality:\n",
    "        #   tensor([[    0, 39796],\n",
    "        #           [    1, 32055],\n",
    "        #           [    1, 32070],\n",
    "        #           [    2, 39190],\n",
    "        #           [    2, 40255],\n",
    "        #           [    3, 40390],\n",
    "        #           [    3, 41455],\n",
    "        #           [    4, 45470],\n",
    "        #           [    5, 45325],\n",
    "        #           [    5, 46390]])\n",
    "        # Each row is a (gt index, prediction index)\n",
    "        # Note gt items 1, 2, 3, and 5 each have two ties\n",
    "        # 通过上面例子，发现给定gt，如果与之对应的最大iou的pred_box有多个，则都认为是目标\n",
    "        pred_inds_to_update = gt_pred_pairs_of_highest_quality[:, 1]\n",
    "        matches[pred_inds_to_update] = all_matches[pred_inds_to_update] "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这里还有一个细节需要我们注意：RPN网络生成的锚框中正样本与负样本的比例并不均衡，一般负样本数量远多于正样本。因此，在训练时需要平衡两者的数量。可以通过随机采样的方式进行二者比例的控制，具体实现如下：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class BalancedPositiveNegativeSampler(object):\n",
    "    \n",
    "    def __init__(self, batch_size_per_image, positive_fraction):\n",
    "        # type: (int, float)\n",
    "        self.batch_size_per_image = batch_size_per_image\n",
    "        self.positive_fraction = positive_fraction\n",
    "\n",
    "    def __call__(self, matched_idxs):\n",
    "        # type: (List[Tensor])\n",
    "        \"\"\"\n",
    "        输入：matched_idx的取值：0为背景(负样本)，-1为介于背景和目标之间，>0为目标(正样本)\n",
    "        返回：pos_idx和neg_idx，分别记录正样本和负样本\n",
    "        \"\"\"\n",
    "        pos_idx = []\n",
    "        neg_idx = []\n",
    "        for matched_idxs_per_image in matched_idxs:\n",
    "            # 0为背景(负样本)，-1为介于背景和目标之间，>0为目标(正样本)\n",
    "            positive = torch.nonzero(matched_idxs_per_image >= 1).squeeze(1)\n",
    "            negative = torch.nonzero(matched_idxs_per_image == 0).squeeze(1)\n",
    "\n",
    "            # 计划采样的正样本数量\n",
    "            num_pos = int(self.batch_size_per_image * self.positive_fraction) \n",
    "            num_pos = min(positive.numel(), num_pos)\n",
    "\n",
    "            # 计划采样的负样本数量\n",
    "            num_neg = self.batch_size_per_image - num_pos \n",
    "            num_neg = min(negative.numel(), num_neg)\n",
    "\n",
    "            # 随机采样\n",
    "            perm1 = torch.randperm(positive.numel(), device=positive.device)[:num_pos]\n",
    "            perm2 = torch.randperm(negative.numel(), device=negative.device)[:num_neg]\n",
    "\n",
    "            pos_idx_per_image = positive[perm1]\n",
    "            neg_idx_per_image = negative[perm2]\n",
    "            \n",
    "            # 生成索引mask\n",
    "            pos_idx_per_image_mask = zeros_like(\n",
    "                matched_idxs_per_image, dtype=torch.uint8\n",
    "            )\n",
    "            neg_idx_per_image_mask = zeros_like(\n",
    "                matched_idxs_per_image, dtype=torch.uint8\n",
    "            )\n",
    "\n",
    "            pos_idx_per_image_mask[pos_idx_per_image] = torch.tensor(1, dtype=torch.uint8)\n",
    "            neg_idx_per_image_mask[neg_idx_per_image] = torch.tensor(1, dtype=torch.uint8)\n",
    "\n",
    "            pos_idx.append(pos_idx_per_image_mask)\n",
    "            neg_idx.append(neg_idx_per_image_mask)\n",
    "\n",
    "        return pos_idx, neg_idx"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在明白了上述流程之后，我们理解compute_loss模块也就会比较容易了。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def compute_loss(self, objectness, pred_bbox_deltas, labels, regression_targets):\n",
    "\n",
    "        sampled_pos_inds, sampled_neg_inds = self.fg_bg_sampler(labels)\n",
    "        # fg_bg_sampler将调用BalancedPositiveNegativeSampler\n",
    "        # 目标检测的特点是负样本数量远大于正样本数量，需要通过BalancedPositiveNegativeSampler平衡正负样本\n",
    "\n",
    "        sampled_pos_inds = torch.nonzero(torch.cat(sampled_pos_inds, dim=0)).squeeze(1)\n",
    "        sampled_neg_inds = torch.nonzero(torch.cat(sampled_neg_inds, dim=0)).squeeze(1)\n",
    "\n",
    "        sampled_inds = torch.cat([sampled_pos_inds, sampled_neg_inds], dim=0)\n",
    "\n",
    "        objectness = objectness.flatten()\n",
    "\n",
    "        labels = torch.cat(labels, dim=0)\n",
    "        regression_targets = torch.cat(regression_targets, dim=0)\n",
    "\n",
    "        # 计算pred_bbox_deltas（预测框相对于anchor的偏移量）与regression_targets（gt相对于anchor的偏移量）的L1 loss\n",
    "        box_loss = F.l1_loss(  \n",
    "            pred_bbox_deltas[sampled_pos_inds],\n",
    "            regression_targets[sampled_pos_inds],\n",
    "            reduction=\"sum\",\n",
    "        ) / (sampled_inds.numel())\n",
    "\n",
    "        # 计算置信度与label的之间交叉熵损失函数，label取值为0,1 \n",
    "        objectness_loss = F.binary_cross_entropy_with_logits(\n",
    "            objectness[sampled_inds], labels[sampled_inds]\n",
    "        )\n",
    "\n",
    "        return objectness_loss, box_loss\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在学习完上述所有内容之后，建议读者重新细读一遍RegionProposalNetwork代码，是不是RPN网络的工作流程就十分清晰了呢。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 12.3.3.7 Faster R-CNN的数据转换"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在训练图像输入进主干网络用以提取特征图之前，我们还需要对图像进行变换，以增强学习的鲁棒性。在Faster R-CNN中，会对输入的训练图像进行以下三步操作：\n",
    "1. normalize，即对图像进行标准化；\n",
    "2. resize，即对图像进行缩放，需要对图像和真实边界框同时进行缩放；\n",
    "3. batch_image，即将一个批次的图像统一进入tensor张量中。\n",
    "\n",
    "这三个模块全部定义在GeneralizedRCNNTransform类中：\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class GeneralizedRCNNTransform(nn.Module):\n",
    "\n",
    "    def __init__(self, min_size, max_size, image_mean, image_std):\n",
    "        super(GeneralizedRCNNTransform, self).__init__()\n",
    "        if not isinstance(min_size, (list, tuple)):\n",
    "            min_size = (min_size,)\n",
    "        self.min_size = min_size \n",
    "        #  min_size是list或tuple，training时，会随机从中选取一个size，testing时固定选择最后一个size\n",
    "        self.max_size = max_size\n",
    "        self.image_mean = image_mean\n",
    "        self.image_std = image_std\n",
    "\n",
    "    def forward(self, images, targets=None):\n",
    "        # type: (List[Tensor], Optional[List[Dict[str, Tensor]]])\n",
    "        images = [img for img in images] \n",
    "        #  深拷贝，这样就不会改变原来的images\n",
    "        for i in range(len(images)):\n",
    "            image = images[i]\n",
    "            target_index = targets[i] if targets is not None else None\n",
    "            if image.dim() != 3:\n",
    "                raise ValueError(\"images is expected to be a list of 3d tensors \"\n",
    "                                 \"of shape [C, H, W], got {}\".format(image.shape))\n",
    "\n",
    "            image = self.normalize(image) \n",
    "            #减去mean，除以std\n",
    "\n",
    "            image, target_index = self.resize(image, target_index)  \n",
    "            # resize同时对image和target操作，target记录着groud-truth bbox所以图片缩放，gt也要跟着缩放\n",
    "            # training时，最终size大小不确定，testing时resize的大小是确定的\n",
    "  \n",
    "            images[i] = image\n",
    "            if targets is not None and target_index is not None:\n",
    "                targets[i] = target_index\n",
    "\n",
    "        image_sizes = [img.shape[-2:] for img in images]  # 记录resize之后，batch-images操作之前的img_size\n",
    "        images = self.batch_images(images) #  batch_images操作，使得一个batch中的图片统一到一个tensor中\n",
    "        image_sizes_list = torch.jit.annotate(List[Tuple[int, int]], [])\n",
    "        for image_size in image_sizes:\n",
    "            assert len(image_size) == 2\n",
    "            image_sizes_list.append((image_size[0], image_size[1]))\n",
    "\n",
    "        image_list = ImageList(images, image_sizes_list)\n",
    "        return image_list, targets\n",
    "    \n",
    "    def normalize(self, image):\n",
    "        dtype, device = image.dtype, image.device\n",
    "        mean = torch.as_tensor(self.image_mean, dtype=dtype, device=device)\n",
    "        std = torch.as_tensor(self.image_std, dtype=dtype, device=device)\n",
    "        return (image - mean[:, None, None]) / std[:, None, None]\n",
    "        \n",
    "    def resize(self, image, target):\n",
    "        h, w = image.shape[-2:]\n",
    "        im_shape = torch.tensor(image.shape[-2:])\n",
    "        min_size = float(torch.min(im_shape))\n",
    "        max_size = float(torch.max(im_shape))\n",
    "\n",
    "        if self.training:\n",
    "            size = float(self.torch_choice(self.min_size))\n",
    "            # training时，会随机选取一个size\n",
    "            # min_size是为list，torch_choice返回一个随机整数[0,len(min_size)]，即随机选min_size的下标值\n",
    "\n",
    "        else:\n",
    "            size = float(self.min_size[-1])\n",
    "            #当testing时，固定resize大小\n",
    "\n",
    "        scale_factor = size / min_size\n",
    "        if max_size * scale_factor > self.max_size: #确保小于max_size\n",
    "            scale_factor = self.max_size / max_size\n",
    "\n",
    "        # 用插值法缩放图片，要求输入为4维，image[None]表示增加一维，(B,H,W)->(1,B,H,W)\n",
    "        image = torch.nn.functional.interpolate(\n",
    "            image[None], scale_factor=scale_factor, mode='bilinear', align_corners=False)[0]\n",
    "\n",
    "        if target is None:\n",
    "            return image, target\n",
    "\n",
    "        bbox = target[\"boxes\"]\n",
    "        bbox = resize_boxes(bbox, (h, w), image.shape[-2:])  # 缩放bbox\n",
    "        target[\"boxes\"] = bbox\n",
    "        ………………\n",
    "        return image, target\n",
    "\n",
    "    def resize_boxes(boxes, original_size, new_size):\n",
    "        # type: (Tensor, List[int], List[int])\n",
    "        ratios = [float(s) / float(s_orig) for s, s_orig in zip(new_size, original_size)]\n",
    "        ratio_height, ratio_width = ratios\n",
    "        xmin, ymin, xmax, ymax = boxes.unbind(1)\n",
    "\n",
    "        xmin = xmin * ratio_width\n",
    "        xmax = xmax * ratio_width\n",
    "        ymin = ymin * ratio_height\n",
    "        ymax = ymax * ratio_height\n",
    "        return torch.stack((xmin, ymin, xmax, ymax), dim=1)\n",
    "\n",
    "    def torch_choice(self, l):       \n",
    "        index = int(torch.empty(1).uniform_(0., float(len(l))).item()) # 返回int[0,len(l)]的随机数\n",
    "        return l[index]\n",
    "    \n",
    "    def batch_images(self, images, size_divisible=32):\n",
    "        # type: (List[Tensor], int)\n",
    "        # 要求最后的图片tensor.shape能被size_divisible整除，而resnet50的最后feature_map_size就是=input_img_size/32\n",
    "       \n",
    "        ………………\n",
    "        # 计算最大的channel,H,W\n",
    "        max_size = self.max_by_axis([list(img.shape) for img in images])\n",
    "\n",
    "        stride = float(size_divisible)\n",
    "        max_size = list(max_size)\n",
    "\n",
    "        # 确保max_size能被size_divisible整除\n",
    "        max_size[1] = int(math.ceil(float(max_size[1]) / stride) * stride)\n",
    "        max_size[2] = int(math.ceil(float(max_size[2]) / stride) * stride)\n",
    "\n",
    "        # +是append操作，batch_shape:(B,C,H,W)\n",
    "        batch_shape = [len(images)] + max_size\n",
    "        batched_imgs = images[0].new_full(batch_shape, 0)\n",
    "        for img, pad_img in zip(images, batched_imgs):\n",
    "            # 所有图像都从左上角开始对齐，剩下部分即为0\n",
    "            pad_img[: img.shape[0], : img.shape[1], : img.shape[2]].copy_(img)\n",
    "\n",
    "        return batched_imgs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们观察forward()函数中的流程，对一张图像，先进行标准化操作，将每一张图像的像素约束在一个合理的区间内；接着，对图像与其真实边界框进行缩放。在训练时，对于每一张图像会随机选择一个缩放的比例，利用双线性插值法对图像进行缩放，而在测试时则固定缩放比例。由于训练时每一张图像的大小都不一样，因此在此之后需要进行批处理，将其调整为一致的大小。这里统计每一个批次中最大图像的宽$W$和高$H$，将所有图像的左上角对齐，通过补零的方式将每张图像的大小调整为$W$和$H$。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 12.4 Faster R-CNN代码实现"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "由于Faster R-CNN代码架构过于庞大，我们将直接调用相关接口进行效果展示。我们先导入必要的包，并使用在MS COCO数据集上预训练的Resnet50作为我们的主干网络。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-01-10T06:28:26.010951Z",
     "iopub.status.busy": "2023-01-10T06:28:26.010181Z",
     "iopub.status.idle": "2023-01-10T06:28:27.260312Z",
     "shell.execute_reply": "2023-01-10T06:28:27.259065Z",
     "shell.execute_reply.started": "2023-01-10T06:28:26.010824Z"
    }
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import numpy as np\n",
    "import functools\n",
    "import matplotlib.pyplot as plt\n",
    "import cv2\n",
    "import torch\n",
    "import torchvision\n",
    "from torchvision.models.detection.faster_rcnn import FastRCNNPredictor"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-01-10T06:28:29.692266Z",
     "iopub.status.busy": "2023-01-10T06:28:29.691749Z",
     "iopub.status.idle": "2023-01-10T06:29:48.004555Z",
     "shell.execute_reply": "2023-01-10T06:29:48.003203Z",
     "shell.execute_reply.started": "2023-01-10T06:28:29.692235Z"
    }
   },
   "outputs": [],
   "source": [
    "! pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'\n",
    "! git clone https://github.com/pytorch/vision.git\n",
    "! cd vision;cp references/detection/utils.py ../;cp references/detection/transforms.py ../;cp references/detection/coco_eval.py ../;cp references/detection/engine.py ../;cp references/detection/coco_utils.py ../"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-01-10T06:30:28.868123Z",
     "iopub.status.busy": "2023-01-10T06:30:28.867452Z",
     "iopub.status.idle": "2023-01-10T06:30:41.051748Z",
     "shell.execute_reply": "2023-01-10T06:30:41.050718Z",
     "shell.execute_reply.started": "2023-01-10T06:30:28.868088Z"
    }
   },
   "outputs": [],
   "source": [
    "# 加载模型，使用COCO数据集预训练\n",
    "model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained='coco')\n",
    "\n",
    "num_classes = 21  \n",
    "\n",
    "# 得到输入数据的维度\n",
    "in_features = model.roi_heads.box_predictor.cls_score.in_features\n",
    "\n",
    "# 将模型的头部网络替换成另一个\n",
    "model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)\n",
    "\n",
    "device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')\n",
    "\n",
    "model.to(device)\n",
    "params = [p for p in model.parameters() if p.requires_grad]\n",
    "optimizer = torch.optim.SGD(params, lr=0.001, weight_decay=0.0005)\n",
    "lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-01-10T06:31:17.554101Z",
     "iopub.status.busy": "2023-01-10T06:31:17.553397Z",
     "iopub.status.idle": "2023-01-10T06:31:20.117321Z",
     "shell.execute_reply": "2023-01-10T06:31:20.116359Z",
     "shell.execute_reply.started": "2023-01-10T06:31:17.554066Z"
    }
   },
   "outputs": [],
   "source": [
    "WEIGHTS_FILE = \"../input/fasterrcnn/faster_rcnn_state.pth\"\n",
    "model.load_state_dict(torch.load(WEIGHTS_FILE))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们再编写绘制模型检测框的代码，并在ImageNet数据集上评估效果。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-01-10T06:31:22.873928Z",
     "iopub.status.busy": "2023-01-10T06:31:22.873574Z",
     "iopub.status.idle": "2023-01-10T06:31:22.884847Z",
     "shell.execute_reply": "2023-01-10T06:31:22.883595Z",
     "shell.execute_reply.started": "2023-01-10T06:31:22.873899Z"
    }
   },
   "outputs": [],
   "source": [
    "# 绘制检测框 \n",
    "def obj_detector(img):\n",
    "    img = cv2.imread(img, cv2.IMREAD_COLOR)\n",
    "    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB).astype(np.float32)\n",
    "\n",
    "    # 导入图像并对图像进行处理\n",
    "    img /= 255.0\n",
    "    img = torch.from_numpy(img)\n",
    "    img = img.unsqueeze(0)\n",
    "    img = img.permute(0,3,1,2)\n",
    "    \n",
    "    model.eval()\n",
    "    \n",
    "    # 设置阈值\n",
    "    detection_threshold = 0.70\n",
    "    \n",
    "    img = list(im.to(device) for im in img)\n",
    "    output = model(img)\n",
    "\n",
    "    for i , im in enumerate(img):\n",
    "        boxes = output[i]['boxes'].data.cpu().numpy()\n",
    "        scores = output[i]['scores'].data.cpu().numpy()\n",
    "        labels = output[i]['labels'].data.cpu().numpy()\n",
    "        \n",
    "        # 获得标签、预测框等信息\n",
    "        labels = labels[scores >= detection_threshold]\n",
    "        boxes = boxes[scores >= detection_threshold].astype(np.int32)\n",
    "        scores = scores[scores >= detection_threshold]\n",
    "\n",
    "        boxes[:, 2] = boxes[:, 2] - boxes[:, 0]\n",
    "        boxes[:, 3] = boxes[:, 3] - boxes[:, 1]\n",
    "    \n",
    "    sample = img[0].permute(1,2,0).cpu().numpy()\n",
    "    sample = np.array(sample)\n",
    "    \n",
    "    boxes = output[0]['boxes'].data.cpu().numpy()\n",
    "    name = output[0]['labels'].data.cpu().numpy()\n",
    "    scores = output[0]['scores'].data.cpu().numpy()\n",
    "    \n",
    "    boxes = boxes[scores >= detection_threshold].astype(np.int32)\n",
    "    names = name.tolist()\n",
    "    \n",
    "    return names, boxes, sample"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们在ImageNet上检测模型的效果。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-01-10T06:31:25.520032Z",
     "iopub.status.busy": "2023-01-10T06:31:25.519315Z",
     "iopub.status.idle": "2023-01-10T06:31:36.838983Z",
     "shell.execute_reply": "2023-01-10T06:31:36.837240Z",
     "shell.execute_reply.started": "2023-01-10T06:31:25.519996Z"
    }
   },
   "outputs": [],
   "source": [
    "pred_path = \"../input/imagenet/imagenet/val/\"\n",
    "pred_files = [os.path.join(pred_path,f) for f in os.listdir(pred_path)]\n",
    "\n",
    "classes= {1:'aeroplane',2:'bicycle',3:'bird',4:'boat',5:'bottle',6:'bus',7:'car',\n",
    "          8:'cat',9:'chair',10:'cow',11:'diningtable',12:'dog',13:'horse',14:'motorbike',\n",
    "          15:'person',16:'pottedplant',17:'sheep',18:'sofa',19:'train',20:'tvmonitor'}\n",
    "\n",
    "plt.figure(figsize=(20,60))\n",
    "for i, images in enumerate(pred_files):\n",
    "    if i > 19:\n",
    "        break\n",
    "        \n",
    "    plt.subplot(10,2,i+1)\n",
    "    names, boxes, sample = obj_detector(images)\n",
    "    \n",
    "    for i,box in enumerate(boxes):\n",
    "        # 绘制预测框\n",
    "        cv2.rectangle(sample, (box[0], box[1]), (box[2], box[3]), (0, 220, 0), 2)\n",
    "        cv2.putText(sample, classes[names[i]], (box[0],box[1]-5),cv2.FONT_HERSHEY_COMPLEX ,0.7,(220,0,0),1,cv2.LINE_AA)  \n",
    "\n",
    "    plt.axis('off')\n",
    "    plt.imshow(sample)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 12.5 小结"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在这一章中，我们学习了目标检测的基本原理，详细介绍了深度学习时代最具有代表性的目标检测模型——R-CNN系列模型及它们的发展脉络，并动手实现了相应的功能。作为计算机视觉的最重要的基本任务之一，目标检测有着广泛的工业应用场景，如汽车产业的辅助驾驶系统。接下来，我们将学习图像语义理解中的另一个重要任务——实例分割。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 12.6 习题"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 习题12.1：请结合R-CNN、Fast R-CNN、Faster R-CNN的技术特点，探讨它们在目标检测任务中的优劣势，并给出你的个人看法。\n",
    "\n",
    "#### 习题12.2：在目标检测任务中，如果将分类网络和回归网络的输出融合在一起，是否会对性能有所提升？为什么？请详细阐述。\n",
    "\n",
    "#### 习题12.3：Faster R-CNN 采用 RPN 生成候选区域，那么 RPN 网络是如何生成不同大小、长宽比的锚框的？它又是如何通过训练确定这些锚框的位置和大小的？\n",
    "\n",
    "#### 习题12.4：请简述目标检测任务中的AP指标（Average Precision）及其计算方法。\n",
    "\n",
    "#### 习题12.5：目标检测任务中，如何解决目标漏检问题？请给出至少两种常见的解决方案，并分别进行说明。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 参考文献"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[1] Ross B. Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik: Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. CVPR 2014: 580-587\n",
    "\n",
    "[2] Bogdan Alexe, Thomas Deselaers, Vittorio Ferrari: What is an object? CVPR 2010: 73-80\n",
    "\n",
    "[3] Ross B. Girshick: Fast R-CNN. ICCV 2015: 1440-1448\n",
    "\n",
    "[4] Shaoqing Ren, Kaiming He, Ross B. Girshick, Jian Sun: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015: 91-99\n",
    "\n",
    "[5] Jasper R. R. Uijlings, Koen E. A. van de Sande, Theo Gevers, Arnold W. M. Smeulders: Selective Search for Object Recognition. Int. J. Comput. Vis. 104(2): 154-171 (2013)\n",
    "\n",
    "[6] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, Alexander C. Berg: SSD: Single Shot MultiBox Detector. ECCV (1) 2016: 21-37\n",
    "\n",
    "[7] Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, Ali Farhadi: You Only Look Once: Unified, Real-Time Object Detection. CVPR 2016: 779-788\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
